Timmy-time-dashboard/docs/research/autoresearch-h1-baseline.md

# Autoresearch H1 — M3 Max Baseline

**Status:** Baseline established (Issue #905)
**Hardware:** Apple M3 Max · 36 GB unified memory
**Date:** 2026-03-23
**Refs:** #905 · #904 (parent) · #881 (M3 Max compute) · #903 (MLX benchmark)

---

## Setup

### Prerequisites

```bash
# Install MLX (Apple Silicon — definitively faster than llama.cpp per #903)
pip install mlx mlx-lm

# Install project deps
tox -e dev  # or: pip install -e '.[dev]'
```

### Clone & prepare

`prepare_experiment` in `src/timmy/autoresearch.py` handles the clone.
On Apple Silicon it automatically sets `AUTORESEARCH_BACKEND=mlx` and
`AUTORESEARCH_DATASET=tinystories`.

```python
from timmy.autoresearch import prepare_experiment
status = prepare_experiment("data/experiments", dataset="tinystories", backend="auto")
print(status)
```

Or via the dashboard: `POST /experiments/start` (requires `AUTORESEARCH_ENABLED=true`).

### Configuration (`.env` / environment)

```
AUTORESEARCH_ENABLED=true
AUTORESEARCH_DATASET=tinystories   # lower-entropy dataset, faster iteration on Mac
AUTORESEARCH_BACKEND=auto          # resolves to "mlx" on Apple Silicon
AUTORESEARCH_TIME_BUDGET=300       # 5-minute wall-clock budget per experiment
AUTORESEARCH_MAX_ITERATIONS=100
AUTORESEARCH_METRIC=val_bpb
```

### Why TinyStories?

Karpathy's recommendation for resource-constrained hardware: lower entropy
means the model can learn meaningful patterns in less time and with a smaller
vocabulary, yielding cleaner val_bpb curves within the 5-minute budget.

---

## M3 Max Hardware Profile

| Spec | Value |
|------|-------|
| Chip | Apple M3 Max |
| CPU cores | 16 (12P + 4E) |
| GPU cores | 40 |
| Unified RAM | 36 GB |
| Memory bandwidth | 400 GB/s |
| MLX support | Yes (confirmed #903) |

MLX utilises the unified memory architecture — model weights, activations, and
training data all share the same physical pool, eliminating PCIe transfers.
This gives M3 Max a significant throughput advantage over external GPU setups
for models that fit in 36 GB.

---

## Community Reference Data

| Hardware | Experiments | Succeeded | Failed | Outcome |
|----------|-------------|-----------|--------|---------|
| Mac Mini M4 | 35 | 7 | 28 | Model improved by simplifying |
| Shopify (overnight) | ~50 | — | — | 19% quality gain; smaller beat 2× baseline |
| SkyPilot (16× GPU, 8 h) | ~910 | — | — | 2.87% improvement |
| Karpathy (H100, 2 days) | ~700 | 20+ | — | 11% training speedup |

**Mac Mini M4 failure rate: 80% (26/35).** Failures are expected and by design —
the 5-minute budget deliberately prunes slow experiments. The 20% success rate
still yielded an improved model.

---

## Baseline Results (M3 Max)

> Fill in after running: `timmy learn --target <module> --metric val_bpb --budget 5 --max-experiments 50`

| Run | Date | Experiments | Succeeded | val_bpb (start) | val_bpb (end) | Δ |
|-----|------|-------------|-----------|-----------------|---------------|---|
| 1 | — | — | — | — | — | — |

### Throughput estimate

Based on the M3 Max hardware profile and Mac Mini M4 community data, expected
throughput is **8–14 experiments/hour** with the 5-minute budget and TinyStories
dataset. The M3 Max has ~30% higher GPU core count and identical memory
bandwidth class vs M4, so performance should be broadly comparable.

---

## Apple Silicon Compatibility Notes

### MLX path (recommended)

- Install: `pip install mlx mlx-lm`
- `AUTORESEARCH_BACKEND=auto` resolves to `mlx` on arm64 macOS
- Pros: unified memory, no PCIe overhead, native Metal backend
- Cons: MLX op coverage is a subset of PyTorch; some custom CUDA kernels won't port

### llama.cpp path (fallback)

- Use when MLX op support is insufficient
- Set `AUTORESEARCH_BACKEND=cpu` to force CPU mode
- Slower throughput but broader op compatibility

### Known issues

- `subprocess.TimeoutExpired` is the normal termination path — autoresearch
  treats timeout as a completed-but-pruned experiment, not a failure
- Large batch sizes may trigger OOM if other processes hold unified memory;
  set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` to disable the MPS high-watermark

---

## Next Steps (H2)

See #904 Horizon 2 for the meta-autoresearch plan: expand experiment units from
code changes → system configuration changes (prompts, tools, memory strategies).