[SOTA] Deploy MLX on Mac M3 Max for faster local inference #416

Closed
opened 2026-04-08 11:20:55 +00:00 by Timmy · 0 comments
Owner

From SOTA research Q2 2026.

MLX (25K★) — Apple's native ML framework for M-series chips. Direct Metal acceleration. Can be faster than llama.cpp for some models.

Our Mac M3 Max (36GB) is the most powerful machine in the fleet. MLX could unlock faster inference than llama-server for Timmy's own sessions.

Acceptance Criteria

  • Install mlx and mlx-lm
  • Benchmark: MLX vs llama.cpp on hermes3:8b (tok/s comparison)
  • If faster: set up MLX serving on a port alongside llama-server
  • NOT a replacement for llama.cpp on VPSes (MLX is Apple-only)
From SOTA research Q2 2026. **MLX** (25K★) — Apple's native ML framework for M-series chips. Direct Metal acceleration. Can be faster than llama.cpp for some models. Our Mac M3 Max (36GB) is the most powerful machine in the fleet. MLX could unlock faster inference than llama-server for Timmy's own sessions. ## Acceptance Criteria - [ ] Install mlx and mlx-lm - [ ] Benchmark: MLX vs llama.cpp on hermes3:8b (tok/s comparison) - [ ] If faster: set up MLX serving on a port alongside llama-server - [ ] NOT a replacement for llama.cpp on VPSes (MLX is Apple-only)
Timmy self-assigned this 2026-04-08 11:20:56 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#416