Apple MLX has quietly become the fastest way to run, quantize, and fine-tune large language models on a Mac. It is Apple's open-source array framework for Apple Silicon, and its defining idea is unified memory: model weights, the KV cache, and activations all live in one physical pool shared by the CPU and GPU, so operations dispatch across devices with no copying. For developers, that erases the VRAM wall that constrains consumer NVIDIA GPUs.
MLX first shipped on December 5, 2023 from Apple Machine Learning Research. By June 2026 the project has crossed 27,300 GitHub stars, sits at stable version v0.31.2, and anchors a model hub of roughly 4,800 pre-converted MLX models on Hugging Face. The story this year is twofold: the M5 chip's GPU Neural Accelerators reset prompt-speed expectations, and WWDC 2026 opened Apple's Foundation Models framework to any MLX backend.
This guide is written for developers deciding whether MLX belongs in their stack. It covers the framework's mental model, the mlx-lm toolchain, the M5 performance numbers and what is independently confirmed versus vendor-stated, how fine-tuning memory math actually works, and the new agentic and Swift integration paths. Every figure here is labeled by source, because local-AI benchmarks move fast and cross-vendor claims deserve scrutiny.
- 01MLX is built around unified memory, not bolted onto it.Arrays live in shared CPU/GPU memory, so operations dispatch across devices with zero copies. Lazy evaluation lets the framework fuse ops before any compute is dispatched. The Python API closely follows NumPy, with Swift, C++, and C bindings that mirror it.
- 02mlx-lm is the whole toolchain in one pip install.Text generation, interactive chat, quantization, LoRA/QLoRA fine-tuning, and an OpenAI-compatible HTTP server all ship in mlx-lm. The default model downloads automatically; thousands of Hugging Face models work out of the box.
- 03M5 resets prompt-processing speed.Apple's own tests show the M5's GPU Neural Accelerators deliver 3.3x to 4.06x faster time-to-first-token than M4 across Qwen 1.7B–14B and GPT OSS 20B. Token generation, which is bandwidth-bound, improves a more modest 1.19x to 1.27x.
- 04No VRAM wall is the real Mac advantage.Because memory is unified, a 32GB Mac can fine-tune a 14B model that would out-of-memory a 24GB discrete GPU. The win is capacity, not raw training throughput — Apple Silicon bandwidth still trails a high-end NVIDIA card.
- 05WWDC 2026 made any MLX model a Swift-native backend.Apple opened the Foundation Models framework to custom backends and shipped MLXLanguageModel, which loads any mlx-community model into the same Swift API — streaming, tool calling, and structured output included.
01 — The FrameworkWhat MLX actually is.
MLX is an array framework — think NumPy or JAX, not a model zoo — designed specifically for Apple Silicon. It was released on December 5, 2023 by Apple Machine Learning Research, authored equally by Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. The team's stated intent is unusually clear: a framework built by machine-learning researchers for machine-learning researchers, user-friendly but still efficient to train and deploy models.
Two design choices define it. First, the Python API closely follows NumPy, making it immediately legible to any Python practitioner, with fully featured Swift, C++, and C APIs that mirror the same interface. Second, MLX uses lazy evaluation — arrays are only materialized when needed — which lets the framework fuse operations and cut memory-allocation overhead before any compute reaches the hardware.
The project is also genuinely active. MLX has shipped roughly 73 releases since launch, on a cadence of about one every three to four weeks, and reached stable version v0.31.2 on April 22, 2026. That velocity matters: features like Neural Accelerator support and distributed inference are recent additions, so pinning a known-good version is part of operating MLX responsibly.
GitHub stars
MLX has crossed 27,300 stars and 2,000 forks on ml-explore/mlx as of June 2026 — independently confirmed on the repository.
Releases since 2023
About one release every three to four weeks since the December 2023 launch. Current stable is v0.31.2, with mlx-lm at v0.31.3.
Pre-converted MLX models
The mlx-community Hugging Face org hosts approximately 4,800 quantized MLX models as reported at WWDC 2026 — LLMs, VLMs, audio, and image generation.
02 — Unified MemoryUnified memory and the missing VRAM wall.
The single most important thing to understand about MLX is its memory model. On a discrete GPU, model weights live in dedicated VRAM and every tensor that the CPU touches must be copied across the PCIe bus. MLX inherits Apple Silicon's unified memory instead: arrays live in one shared pool, and operations dispatch across the CPU and GPU without any data transfer. The KV cache, the model weights, and the activations all reside in the same physical memory.
This is a genuinely different model from PyTorch's MPS backend, which adapts a CUDA-centric design to Metal. In late-2025 research comparing runtimes on Apple Silicon, PyTorch MPS was documented with a roughly 4GB single-tensor cap that triggers out-of-memory errors beyond about 2,000 tokens, with no advanced caching strategies. MLX has no such cap — the entire unified pool is available — and the same study measured MLX delivering meaningfully faster steady-state generation on identical hardware.
The practical consequence is the headline of this whole section: a 32GB Mac can comfortably fine-tune a 14B model that would crash a 24GB consumer GPU on its dedicated VRAM. If you are weighing a Mac against a discrete-GPU box, our DGX Spark vs M5 Max vs RTX PRO 6000 comparison breaks down where capacity beats raw bandwidth.
| Capability | MLX | llama.cpp | PyTorch MPS |
|---|---|---|---|
| Memory and limits | |||
| Native memory model | Unified, zero-copy CPU/GPU | GGUF mmap, CPU + Metal offload | CUDA-style, adapted to Metal |
| Single-tensor cap | None (full unified pool) | None | ~4GB cap, OOM past ~2k tokens * |
| Features | |||
| Quantization depth | 3–8 bit + mxfp8 / nvfp4 | 2–8 bit GGUF | Limited |
| LoRA / QLoRA fine-tuning | Built in (mlx_lm.lora) | Inference-focused | Yes, but memory-capped |
| OpenAI-compatible server | mlx_lm.server | llama-server | None built in |
| Multi-Mac distributed | Thunderbolt RDMA (v0.30.1+) * | RPC backend | No |
| Developer fit | |||
| Primary interface | NumPy-like Python + Swift/C++/C | C/C++ with bindings | PyTorch Python |
| Best use case | Native Apple Silicon train + serve | Portable cross-platform inference | Porting existing PyTorch code |
* The PyTorch MPS 4GB single-tensor cap and the MLX Thunderbolt RDMA backend (v0.30.1+, macOS 26.2+) are documented in late-2025/2026 sources; verify both against current releases, since each is tied to a specific framework or OS version.
03 — Function TransformsNo backward(): transforms, not tensor state.
For PyTorch developers, MLX's biggest conceptual jump is how it handles gradients. There is no backward(), zero_grad(), or requires_grad. Instead, MLX follows the JAX paradigm: gradients are computed by transforming functions, not by mutating tensor state. You write a pure function, then wrap it.
MLX provides four core composable function transforms — grad(), vmap(), jvp(), and vjp() — plus compile(). Because they are composable, an expression like mx.grad(mx.vmap(mx.grad(fn))) is valid Python: a vmapped, second-order gradient in one line. The mx.compile() transform caches and optimizes the compute graph, so repeated calls inside a training loop skip graph re-analysis — analogous to torch.compile, but expressed as a first-class composable transform rather than a wrapper.
“If you are coming to MLX from PyTorch, you no longer need functions like backward, zero_grad, and detach, or properties like requires_grad.”— MLX documentation, Function Transforms guide
This is not just stylistic. Transforming functions rather than mutating tensors makes higher-order gradients and per-sample vectorization fall out naturally, and it removes a whole class of stateful bugs — forgotten zero_grad() calls, stray requires_grad flags — that PyTorch users learn to dread. If you already think in NumPy and have brushed against JAX, MLX will feel familiar within an afternoon.
04 — mlx-lmmlx-lm: run, quantize, and serve from one install.
Most developers will not touch raw MLX arrays day to day — they will live in mlx-lm, the official LLM layer built on top of MLX (currently v0.31.3, released alongside MLX on April 22, 2026). A single pip install mlx-lm gives you generation, chat, model conversion, fine-tuning, and a local server. Its default model for generation and chat is mlx-community/Llama-3.2-3B-Instruct-4bit, which downloads automatically, and thousands of Hugging Face models work out of the box.
For agentic and long-context work, two features matter most. mlx-lm ships prompt caching, which cuts subsequent-response latency, and rotating fixed-size KV caches for long-context prompts — essential when an agent session grows to hundreds of thousands of tokens. Quantization spans 3-bit through 8-bit plus the mxfp8 and nvfp4 mixed-precision formats, whose Metal kernel support landed in MLX v0.30.3 in January 2026.
mlx_lm.generate
Run any mlx-community model from the CLI or Python. The default Llama-3.2-3B-Instruct-4bit downloads on first call; swap in any Hugging Face MLX model with a flag.
mlx_lm.convert -q
Convert and quantize Hugging Face checkpoints to MLX format in one command. A 4-bit 7B model shrinks weights roughly 3.5x versus BF16, fitting in 8–9GB of unified memory.
mlx_lm.lora
QLoRA is automatic: point --model at a quantized model and training uses QLoRA with no extra flags. Defaults adapt 16 layers; --grad-checkpoint trades compute for memory.
mlx_lm.server
Exposes /v1/chat/completions at 127.0.0.1:8080 with tool calling and reasoning-model support. Any OpenAI-protocol agent framework points at it as a drop-in local API.
If you are choosing a quantization format for your converted models, the trade-offs between MLX, GGUF, AWQ, and GPTQ are not interchangeable — accuracy and speed differ by hardware target. Our quantization formats comparison covers which format fits a Mac versus a CUDA box.
05 — M5 Neural AcceleratorsM5 resets prompt-processing speed.
Apple's own MLX-on-M5 research draws a clean line between two phases of inference. Prompt processing — time-to-first-token — is compute-bound and runs on the M5's new GPU Neural Accelerators. Token generation afterward is memory-bandwidth-bound. The two scale very differently, and conflating them is the most common mistake in local-AI benchmarking.
For prompt processing, Apple measured the M5 hitting 3.33x to 4.06x faster TTFT than the M4 across Qwen 1.7B–14B and GPT OSS 20B — and a dense 14B model dropping under 10 seconds to first token on a 24GB M5 MacBook Pro. Token generation improves a more modest 1.19x to 1.27x, roughly proportional to the M5 base chip's 28% bandwidth bump (153 GB/s versus the M4's 120 GB/s). These are vendor-stated figures, so treat them as a directional ceiling, not a guarantee.
M5 vs M4 TTFT
Apple-tested on Qwen 14B 4-bit; the range across tested models is 3.33x–4.06x. Compute-bound and powered by the M5's GPU Neural Accelerators.
M5 vs M4 decode
Memory-bandwidth-bound, so the gain is smaller — 1.19x to 1.27x — and tracks the bandwidth increase rather than the compute one.
M5 base vs 120 on M4
A 28% increase on the base chip. Higher-tier parts scale far past this — a Max-class chip runs several times the base bandwidth, which is why generation speed varies so widely by configuration.
06 — BenchmarksMLX vs llama.cpp, read honestly.
The clearest signal that MLX has won the Apple Silicon performance argument came on March 30, 2026, when Ollama v0.19 switched its Apple Silicon backend from llama.cpp to MLX. On the same M5 hardware running Qwen3.5-35B-A3B in NVFP4, Ollama reported prefill rising from 1,154 to 1,810 tokens per second and decode from 58 to 112 tokens per second — a +57% and +93% jump respectively. These are vendor-reported numbers on Ollama's own test rig.
Ollama on M5 · llama.cpp backend → MLX backend
Source: Ollama engineering blog, Mar 2026 (vendor)Independent research backs the direction even if the exact figures vary by who is measuring. A 2026 comparative study found MLX achieving 21% to 87% higher throughput than llama.cpp across models from Qwen3-0.6B to Nemotron-30B, with a peak around 525 tok/s on text models on an M4 Max. That study comes from the vllm-mlx authors, so its more dramatic caching claims should not be generalized — but the broad finding, that MLX sustains higher generation throughput on Apple Silicon, shows up consistently. LM Studio reached the same conclusion back in October 2024, when its v0.3.4 release became the first major GUI to ship an MLX engine.
If you are weighing runners rather than frameworks, our guides to running local LLMs with Ollama, LM Studio, and vLLM and to LM Studio's MLX backend cover the GUI and server options that sit on top of MLX.
“Prefill jumping from 1,154 to 1,810 tokens per second, and decode from 58 to 112 tokens per second”— Ollama engineering blog, March 2026 (vendor-stated, M5)
07 — Fine-TuningFine-tuning on a Mac, by the memory math.
Fine-tuning is where the no-VRAM-wall advantage gets concrete. mlx_lm.lora supports LoRA, DoRA, and full fine-tuning, and QLoRA is automatic — if --model points at a quantized model, training uses QLoRA with no extra flags. Useful knobs include --grad-checkpoint for memory savings, --grad-accumulation-steps, and --num-layers (16 adapted by default). Apple's own LORA.md documents LoRA training on an M1 Max (32GB) at roughly 250 tokens per second on WikiSQL.
The right way to size a Mac is to start from the weight footprint. A model's BF16 weights are roughly its parameter count times two bytes; a 4-bit quantization is roughly the parameter count times half a byte. Real runtime memory then adds the KV cache, activations, and framework overhead — which is why the minimum-RAM column below sits well above the bare 4-bit weight size. The table recomputes both footprints from that formula.
| Model size | BF16 weights | 4-bit weights | Min unified RAM | Comfortable Mac |
|---|---|---|---|---|
| 3B | ~6 GB | ~1.5 GB | ~4–6 GB | 8 GB MacBook Air |
| 7–8B | ~14–16 GB | ~3.5–4 GB | ~8–9 GB | 16 GB Mac |
| 14B | ~28 GB | ~7 GB | ~14–18 GB | 32 GB Mac |
| 32B | ~64 GB | ~16 GB | ~20–25 GB | 48–64 GB Mac |
| 70B | ~140 GB | ~35 GB | ~42–48 GB | 64 GB+ Mac |
Weights computed as params x 2 bytes (BF16) and params x ~0.5 bytes (4-bit). The min-RAM column adds KV cache, activations, and overhead for QLoRA fine-tuning and grows with context length, so treat it as a starting point, not a guarantee. Community benchmarks put a Mistral-7B QLoRA run on 5,000 examples at roughly 90 minutes on an M2 Max with about 7GB peak memory — indicative, single-source, and dataset-dependent.
08 — WWDC 2026The 2026 unlock: agentic and Swift-native.
The most significant MLX news of 2026 is not a speed number — it is an integration. At WWDC 2026, Apple opened the Foundation Models framework to any LLM backend and shipped MLXLanguageModel, a conforming backend that loads any mlx-community Hugging Face model directly into the Foundation Models Swift API. Streaming, tool calling, structured output via @Generable, and multi-turn sessions work identically to Apple's built-in system model. For iOS and macOS developers, that turns roughly 4,800 community models into drop-in substitutes for the on-device default.
Two more capabilities round out the agentic story. The OpenAI- compatible mlx_lm.server means any agent framework speaking the chat-completions protocol — OpenCode, LangChain, the Claude agent SDK — can point at a local endpoint as a cloud-API replacement. And for models too large for a single machine, MLX v0.30.1 added RDMA over Thunderbolt through its JACCL backend (macOS 26.2+), letting a small cluster of Macs run distributed inference; a four-node Thunderbolt cluster can reach up to a 3x speedup and host models that exceed any single machine's memory.
MLXLanguageModel backend
WWDC 2026 opened Foundation Models to custom backends. Any mlx-community model now loads into the same Swift API as Apple's system model, with streaming, tools, and @Generable structured output.
OpenAI-compatible local server
mlx_lm.server exposes /v1/chat/completions with tool calling. Continuous batching keeps concurrent agent requests from stalling each other — a drop-in for cloud APIs in local agent loops.
Thunderbolt RDMA cluster
MLX v0.30.1+ on macOS 26.2+ adds RDMA over Thunderbolt. A four-node cluster reaches up to a 3x speedup and runs models larger than any single Mac's memory — still a leading-edge, version-gated feature.
For teams building on-device or hybrid agent stacks, this is the piece that changes the architecture. A local mlx_lm.server endpoint behind the same OpenAI protocol your cloud code already speaks means you can route privacy-sensitive or zero-marginal-cost work to a Mac and burst to the cloud only when you need frontier capability. Our local-AI versus cloud subscription cost analysis works through when that trade-off pays off.
09 — The DecisionA Mac or a GPU box?
MLX does not make Apple Silicon a universal winner — it makes it the right tool for a specific shape of problem. The honest framing is about memory versus bandwidth. A Mac gives you enormous unified capacity for the money and a clean train-plus-serve story; a discrete NVIDIA card gives you several times the memory bandwidth and the mature CUDA ecosystem. Apple Silicon training bandwidth still trails a high-end card meaningfully, so the Mac advantage is fitting big models in memory, not out-running a GPU on raw throughput.
Run a 14B–32B model that OOMs a consumer GPU
Unified memory lets a 32GB or 64GB Mac fine-tune models that a 24GB card cannot hold. If your bottleneck is capacity, not training speed, MLX on a well-specced Mac is the cheapest viable path.
Ship local AI inside an iOS or macOS app
MLXLanguageModel makes any mlx-community model a Foundation Models backend. For app developers who want private, offline inference with native streaming and tool calling, nothing else is this integrated.
Train fast on large datasets
When raw training speed dominates and the model fits in VRAM, a discrete NVIDIA card with its higher bandwidth and CUDA ecosystem still wins. MLX closes the memory gap, not the bandwidth gap.
Serve concurrent agents at scale
A single Mac with mlx_lm.server handles local agent loops and privacy-bound work well. For high-concurrency production serving, pair it with cloud burst capacity rather than scaling Macs alone.
Looking forward, the trajectory favors MLX for an expanding slice of developer work. Each release narrows the software gap with CUDA, Thunderbolt clustering chips away at the single-machine memory ceiling, and the Foundation Models integration gives MLX a distribution channel no competing framework has — every Mac and iPhone app. The constraint that will not move quickly is bandwidth: until Apple closes that gap, the durable MLX thesis is capacity and integration, not peak speed. For teams weighing local versus cloud inference as part of a broader build, our AI transformation engagements start with exactly this kind of hardware-and-cost eval.
10 — ConclusionThe most integrated local-AI stack on any laptop.
MLX turned the Mac into a credible local-AI workstation — by capacity, not by brute force.
Two and a half years after launch, MLX has matured into the default way to do serious local AI on Apple Silicon. The unified-memory model removes the VRAM wall, mlx-lm collapses run, quantize, fine-tune, and serve into one toolchain, and the M5's Neural Accelerators reset prompt-processing speed. The pieces that were missing — a Swift-native backend, an OpenAI-compatible server, distributed inference — all landed in 2026.
The honest caveat is the one worth repeating: Apple Silicon wins on memory capacity and integration, not on raw bandwidth. A high-end discrete GPU is still faster per dollar of throughput when the model fits in VRAM. But for the developer who wants to fine-tune a mid-size model on a laptop, ship private inference inside an app, or run an agent loop with zero marginal cost, MLX is no longer the experimental option — it is the obvious one.
The practical move is to benchmark on your own models and prompts rather than trust any single headline figure, vendor or community. Pin a known-good MLX version, measure TTFT and decode separately, and size your Mac from the weight-footprint math rather than a rule of thumb. Do that, and a Mac you may already own becomes a capable, private, recurring-cost-free AI workstation.