DevelopmentIndustry Guide12 min readPublished June 29, 2026

Zero-copy unified memory · mlx-lm run + fine-tune · 4x faster prompt processing on M5

Apple MLX in 2026: Zero-Copy Local AI on Apple Silicon

MLX is Apple's purpose-built array framework for Apple Silicon, where unified memory makes CPU and GPU operations zero-copy by design. This guide walks developers through the framework's core ideas, the mlx-lm toolchain for running and fine-tuning models, the M5 Neural Accelerator gains, and the WWDC 2026 unlock that drops any mlx-community model into Apple's Foundation Models API.

DA
Digital Applied Team
Senior engineers · Published Jun 29, 2026
PublishedJun 29, 2026
Read time12 min
SourcesApple ML Research + GitHub
Prompt processing M5 vs M4
4.06x
TTFT, Qwen 14B 4-bit
Apple-tested
Throughput vs llama.cpp
87%
peak gain, research study
Pre-converted MLX models
~4,800
on mlx-community (HF)
GitHub stars
27.3K
ml-explore/mlx

Apple MLX has quietly become the fastest way to run, quantize, and fine-tune large language models on a Mac. It is Apple's open-source array framework for Apple Silicon, and its defining idea is unified memory: model weights, the KV cache, and activations all live in one physical pool shared by the CPU and GPU, so operations dispatch across devices with no copying. For developers, that erases the VRAM wall that constrains consumer NVIDIA GPUs.

MLX first shipped on December 5, 2023 from Apple Machine Learning Research. By June 2026 the project has crossed 27,300 GitHub stars, sits at stable version v0.31.2, and anchors a model hub of roughly 4,800 pre-converted MLX models on Hugging Face. The story this year is twofold: the M5 chip's GPU Neural Accelerators reset prompt-speed expectations, and WWDC 2026 opened Apple's Foundation Models framework to any MLX backend.

This guide is written for developers deciding whether MLX belongs in their stack. It covers the framework's mental model, the mlx-lm toolchain, the M5 performance numbers and what is independently confirmed versus vendor-stated, how fine-tuning memory math actually works, and the new agentic and Swift integration paths. Every figure here is labeled by source, because local-AI benchmarks move fast and cross-vendor claims deserve scrutiny.

Key takeaways
  1. 01
    MLX is built around unified memory, not bolted onto it.Arrays live in shared CPU/GPU memory, so operations dispatch across devices with zero copies. Lazy evaluation lets the framework fuse ops before any compute is dispatched. The Python API closely follows NumPy, with Swift, C++, and C bindings that mirror it.
  2. 02
    mlx-lm is the whole toolchain in one pip install.Text generation, interactive chat, quantization, LoRA/QLoRA fine-tuning, and an OpenAI-compatible HTTP server all ship in mlx-lm. The default model downloads automatically; thousands of Hugging Face models work out of the box.
  3. 03
    M5 resets prompt-processing speed.Apple's own tests show the M5's GPU Neural Accelerators deliver 3.3x to 4.06x faster time-to-first-token than M4 across Qwen 1.7B–14B and GPT OSS 20B. Token generation, which is bandwidth-bound, improves a more modest 1.19x to 1.27x.
  4. 04
    No VRAM wall is the real Mac advantage.Because memory is unified, a 32GB Mac can fine-tune a 14B model that would out-of-memory a 24GB discrete GPU. The win is capacity, not raw training throughput — Apple Silicon bandwidth still trails a high-end NVIDIA card.
  5. 05
    WWDC 2026 made any MLX model a Swift-native backend.Apple opened the Foundation Models framework to custom backends and shipped MLXLanguageModel, which loads any mlx-community model into the same Swift API — streaming, tool calling, and structured output included.

01The FrameworkWhat MLX actually is.

MLX is an array framework — think NumPy or JAX, not a model zoo — designed specifically for Apple Silicon. It was released on December 5, 2023 by Apple Machine Learning Research, authored equally by Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. The team's stated intent is unusually clear: a framework built by machine-learning researchers for machine-learning researchers, user-friendly but still efficient to train and deploy models.

Two design choices define it. First, the Python API closely follows NumPy, making it immediately legible to any Python practitioner, with fully featured Swift, C++, and C APIs that mirror the same interface. Second, MLX uses lazy evaluation — arrays are only materialized when needed — which lets the framework fuse operations and cut memory-allocation overhead before any compute reaches the hardware.

The project is also genuinely active. MLX has shipped roughly 73 releases since launch, on a cadence of about one every three to four weeks, and reached stable version v0.31.2 on April 22, 2026. That velocity matters: features like Neural Accelerator support and distributed inference are recent additions, so pinning a known-good version is part of operating MLX responsibly.

Community
GitHub stars
27.3K

MLX has crossed 27,300 stars and 2,000 forks on ml-explore/mlx as of June 2026 — independently confirmed on the repository.

ml-explore/mlx
Cadence
Releases since 2023
73

About one release every three to four weeks since the December 2023 launch. Current stable is v0.31.2, with mlx-lm at v0.31.3.

v0.31.2 · Apr 22, 2026
Model hub
Pre-converted MLX models
~4,800

The mlx-community Hugging Face org hosts approximately 4,800 quantized MLX models as reported at WWDC 2026 — LLMs, VLMs, audio, and image generation.

huggingface.co/mlx-community

02Unified MemoryUnified memory and the missing VRAM wall.

The single most important thing to understand about MLX is its memory model. On a discrete GPU, model weights live in dedicated VRAM and every tensor that the CPU touches must be copied across the PCIe bus. MLX inherits Apple Silicon's unified memory instead: arrays live in one shared pool, and operations dispatch across the CPU and GPU without any data transfer. The KV cache, the model weights, and the activations all reside in the same physical memory.

This is a genuinely different model from PyTorch's MPS backend, which adapts a CUDA-centric design to Metal. In late-2025 research comparing runtimes on Apple Silicon, PyTorch MPS was documented with a roughly 4GB single-tensor cap that triggers out-of-memory errors beyond about 2,000 tokens, with no advanced caching strategies. MLX has no such cap — the entire unified pool is available — and the same study measured MLX delivering meaningfully faster steady-state generation on identical hardware.

The practical consequence is the headline of this whole section: a 32GB Mac can comfortably fine-tune a 14B model that would crash a 24GB consumer GPU on its dedicated VRAM. If you are weighing a Mac against a discrete-GPU box, our DGX Spark vs M5 Max vs RTX PRO 6000 comparison breaks down where capacity beats raw bandwidth.

Capability and performance comparison of MLX, llama.cpp, and PyTorch MPS for local LLM work on Apple Silicon, grouped by memory model, features, and developer fit.
CapabilityMLXllama.cppPyTorch MPS
Memory and limits
Native memory modelUnified, zero-copy CPU/GPUGGUF mmap, CPU + Metal offloadCUDA-style, adapted to Metal
Single-tensor capNone (full unified pool)None~4GB cap, OOM past ~2k tokens *
Features
Quantization depth3–8 bit + mxfp8 / nvfp42–8 bit GGUFLimited
LoRA / QLoRA fine-tuningBuilt in (mlx_lm.lora)Inference-focusedYes, but memory-capped
OpenAI-compatible servermlx_lm.serverllama-serverNone built in
Multi-Mac distributedThunderbolt RDMA (v0.30.1+) *RPC backendNo
Developer fit
Primary interfaceNumPy-like Python + Swift/C++/CC/C++ with bindingsPyTorch Python
Best use caseNative Apple Silicon train + servePortable cross-platform inferencePorting existing PyTorch code

* The PyTorch MPS 4GB single-tensor cap and the MLX Thunderbolt RDMA backend (v0.30.1+, macOS 26.2+) are documented in late-2025/2026 sources; verify both against current releases, since each is tied to a specific framework or OS version.

The mental shift
A discrete GPU forces you to think in two memory spaces and copy between them. MLX collapses that into one pool. The ceiling on what you can run or fine-tune stops being dedicated VRAM and becomes the total RAM you bought — which is why a well-specced Mac punches far above its bandwidth class for memory-bound LLM work.

03Function TransformsNo backward(): transforms, not tensor state.

For PyTorch developers, MLX's biggest conceptual jump is how it handles gradients. There is no backward(), zero_grad(), or requires_grad. Instead, MLX follows the JAX paradigm: gradients are computed by transforming functions, not by mutating tensor state. You write a pure function, then wrap it.

MLX provides four core composable function transforms — grad(), vmap(), jvp(), and vjp() — plus compile(). Because they are composable, an expression like mx.grad(mx.vmap(mx.grad(fn))) is valid Python: a vmapped, second-order gradient in one line. The mx.compile() transform caches and optimizes the compute graph, so repeated calls inside a training loop skip graph re-analysis — analogous to torch.compile, but expressed as a first-class composable transform rather than a wrapper.

“If you are coming to MLX from PyTorch, you no longer need functions like backward, zero_grad, and detach, or properties like requires_grad.”— MLX documentation, Function Transforms guide

This is not just stylistic. Transforming functions rather than mutating tensors makes higher-order gradients and per-sample vectorization fall out naturally, and it removes a whole class of stateful bugs — forgotten zero_grad() calls, stray requires_grad flags — that PyTorch users learn to dread. If you already think in NumPy and have brushed against JAX, MLX will feel familiar within an afternoon.

04mlx-lmmlx-lm: run, quantize, and serve from one install.

Most developers will not touch raw MLX arrays day to day — they will live in mlx-lm, the official LLM layer built on top of MLX (currently v0.31.3, released alongside MLX on April 22, 2026). A single pip install mlx-lm gives you generation, chat, model conversion, fine-tuning, and a local server. Its default model for generation and chat is mlx-community/Llama-3.2-3B-Instruct-4bit, which downloads automatically, and thousands of Hugging Face models work out of the box.

For agentic and long-context work, two features matter most. mlx-lm ships prompt caching, which cuts subsequent-response latency, and rotating fixed-size KV caches for long-context prompts — essential when an agent session grows to hundreds of thousands of tokens. Quantization spans 3-bit through 8-bit plus the mxfp8 and nvfp4 mixed-precision formats, whose Metal kernel support landed in MLX v0.30.3 in January 2026.

Generate
mlx_lm.generate
one-shot text generation

Run any mlx-community model from the CLI or Python. The default Llama-3.2-3B-Instruct-4bit downloads on first call; swap in any Hugging Face MLX model with a flag.

mlx_lm.chat for interactive
Convert
mlx_lm.convert -q
quantize to 3–8 bit

Convert and quantize Hugging Face checkpoints to MLX format in one command. A 4-bit 7B model shrinks weights roughly 3.5x versus BF16, fitting in 8–9GB of unified memory.

3–8 bit + mxfp8 / nvfp4
Fine-tune
mlx_lm.lora
LoRA · DoRA · full fine-tune

QLoRA is automatic: point --model at a quantized model and training uses QLoRA with no extra flags. Defaults adapt 16 layers; --grad-checkpoint trades compute for memory.

QLoRA auto-detected
Serve
mlx_lm.server
OpenAI-compatible HTTP

Exposes /v1/chat/completions at 127.0.0.1:8080 with tool calling and reasoning-model support. Any OpenAI-protocol agent framework points at it as a drop-in local API.

drop-in cloud replacement

If you are choosing a quantization format for your converted models, the trade-offs between MLX, GGUF, AWQ, and GPTQ are not interchangeable — accuracy and speed differ by hardware target. Our quantization formats comparison covers which format fits a Mac versus a CUDA box.

05M5 Neural AcceleratorsM5 resets prompt-processing speed.

Apple's own MLX-on-M5 research draws a clean line between two phases of inference. Prompt processing — time-to-first-token — is compute-bound and runs on the M5's new GPU Neural Accelerators. Token generation afterward is memory-bandwidth-bound. The two scale very differently, and conflating them is the most common mistake in local-AI benchmarking.

For prompt processing, Apple measured the M5 hitting 3.33x to 4.06x faster TTFT than the M4 across Qwen 1.7B–14B and GPT OSS 20B — and a dense 14B model dropping under 10 seconds to first token on a 24GB M5 MacBook Pro. Token generation improves a more modest 1.19x to 1.27x, roughly proportional to the M5 base chip's 28% bandwidth bump (153 GB/s versus the M4's 120 GB/s). These are vendor-stated figures, so treat them as a directional ceiling, not a guarantee.

Prompt processing
M5 vs M4 TTFT
4.06x

Apple-tested on Qwen 14B 4-bit; the range across tested models is 3.33x–4.06x. Compute-bound and powered by the M5's GPU Neural Accelerators.

Apple ML Research
Token generation
M5 vs M4 decode
1.27x

Memory-bandwidth-bound, so the gain is smaller — 1.19x to 1.27x — and tracks the bandwidth increase rather than the compute one.

Bandwidth-bound
Memory bandwidth
M5 base vs 120 on M4
153GB/s

A 28% increase on the base chip. Higher-tier parts scale far past this — a Max-class chip runs several times the base bandwidth, which is why generation speed varies so widely by configuration.

+28% vs M4 base
Do not conflate the chips
The 153 GB/s figure is the base M5, tested by Apple on a 24GB MacBook Pro. A Max-class chip runs in the hundreds of GB/s — community benchmarks put the M5 Max near 600 GB/s versus 546 GB/s on the M4 Max — roughly four times the base part. Base-chip and Max-chip numbers are not interchangeable, and community tok/s figures are standardized test runs, not Apple-official numbers.

06BenchmarksMLX vs llama.cpp, read honestly.

The clearest signal that MLX has won the Apple Silicon performance argument came on March 30, 2026, when Ollama v0.19 switched its Apple Silicon backend from llama.cpp to MLX. On the same M5 hardware running Qwen3.5-35B-A3B in NVFP4, Ollama reported prefill rising from 1,154 to 1,810 tokens per second and decode from 58 to 112 tokens per second — a +57% and +93% jump respectively. These are vendor-reported numbers on Ollama's own test rig.

Ollama on M5 · llama.cpp backend → MLX backend

Source: Ollama engineering blog, Mar 2026 (vendor)
Prefill throughputQwen3.5-35B-A3B NVFP4 · M5 · tokens/sec
1,810
+57%
Decode throughputQwen3.5-35B-A3B NVFP4 · M5 · tokens/sec
112
+93%
MLX backend (Ollama v0.19)llama.cpp backend (v0.18)

Independent research backs the direction even if the exact figures vary by who is measuring. A 2026 comparative study found MLX achieving 21% to 87% higher throughput than llama.cpp across models from Qwen3-0.6B to Nemotron-30B, with a peak around 525 tok/s on text models on an M4 Max. That study comes from the vllm-mlx authors, so its more dramatic caching claims should not be generalized — but the broad finding, that MLX sustains higher generation throughput on Apple Silicon, shows up consistently. LM Studio reached the same conclusion back in October 2024, when its v0.3.4 release became the first major GUI to ship an MLX engine.

If you are weighing runners rather than frameworks, our guides to running local LLMs with Ollama, LM Studio, and vLLM and to LM Studio's MLX backend cover the GUI and server options that sit on top of MLX.

“Prefill jumping from 1,154 to 1,810 tokens per second, and decode from 58 to 112 tokens per second”— Ollama engineering blog, March 2026 (vendor-stated, M5)

07Fine-TuningFine-tuning on a Mac, by the memory math.

Fine-tuning is where the no-VRAM-wall advantage gets concrete. mlx_lm.lora supports LoRA, DoRA, and full fine-tuning, and QLoRA is automatic — if --model points at a quantized model, training uses QLoRA with no extra flags. Useful knobs include --grad-checkpoint for memory savings, --grad-accumulation-steps, and --num-layers (16 adapted by default). Apple's own LORA.md documents LoRA training on an M1 Max (32GB) at roughly 250 tokens per second on WikiSQL.

The right way to size a Mac is to start from the weight footprint. A model's BF16 weights are roughly its parameter count times two bytes; a 4-bit quantization is roughly the parameter count times half a byte. Real runtime memory then adds the KV cache, activations, and framework overhead — which is why the minimum-RAM column below sits well above the bare 4-bit weight size. The table recomputes both footprints from that formula.

Weight footprint and recommended Mac memory by model size for MLX, with BF16 weights computed as parameters times two bytes and 4-bit weights as parameters times half a byte.
Model sizeBF16 weights4-bit weightsMin unified RAMComfortable Mac
3B~6 GB~1.5 GB~4–6 GB8 GB MacBook Air
7–8B~14–16 GB~3.5–4 GB~8–9 GB16 GB Mac
14B~28 GB~7 GB~14–18 GB32 GB Mac
32B~64 GB~16 GB~20–25 GB48–64 GB Mac
70B~140 GB~35 GB~42–48 GB64 GB+ Mac

Weights computed as params x 2 bytes (BF16) and params x ~0.5 bytes (4-bit). The min-RAM column adds KV cache, activations, and overhead for QLoRA fine-tuning and grows with context length, so treat it as a starting point, not a guarantee. Community benchmarks put a Mistral-7B QLoRA run on 5,000 examples at roughly 90 minutes on an M2 Max with about 7GB peak memory — indicative, single-source, and dataset-dependent.

08WWDC 2026The 2026 unlock: agentic and Swift-native.

The most significant MLX news of 2026 is not a speed number — it is an integration. At WWDC 2026, Apple opened the Foundation Models framework to any LLM backend and shipped MLXLanguageModel, a conforming backend that loads any mlx-community Hugging Face model directly into the Foundation Models Swift API. Streaming, tool calling, structured output via @Generable, and multi-turn sessions work identically to Apple's built-in system model. For iOS and macOS developers, that turns roughly 4,800 community models into drop-in substitutes for the on-device default.

Two more capabilities round out the agentic story. The OpenAI- compatible mlx_lm.server means any agent framework speaking the chat-completions protocol — OpenCode, LangChain, the Claude agent SDK — can point at a local endpoint as a cloud-API replacement. And for models too large for a single machine, MLX v0.30.1 added RDMA over Thunderbolt through its JACCL backend (macOS 26.2+), letting a small cluster of Macs run distributed inference; a four-node Thunderbolt cluster can reach up to a 3x speedup and host models that exceed any single machine's memory.

Swift-native
MLXLanguageModel backend
1

WWDC 2026 opened Foundation Models to custom backends. Any mlx-community model now loads into the same Swift API as Apple's system model, with streaming, tools, and @Generable structured output.

iOS + macOS apps
Agentic
OpenAI-compatible local server
8080

mlx_lm.server exposes /v1/chat/completions with tool calling. Continuous batching keeps concurrent agent requests from stalling each other — a drop-in for cloud APIs in local agent loops.

OpenCode · LangChain · agent SDKs
Distributed
Thunderbolt RDMA cluster
3x

MLX v0.30.1+ on macOS 26.2+ adds RDMA over Thunderbolt. A four-node cluster reaches up to a 3x speedup and runs models larger than any single Mac's memory — still a leading-edge, version-gated feature.

JACCL backend

For teams building on-device or hybrid agent stacks, this is the piece that changes the architecture. A local mlx_lm.server endpoint behind the same OpenAI protocol your cloud code already speaks means you can route privacy-sensitive or zero-marginal-cost work to a Mac and burst to the cloud only when you need frontier capability. Our local-AI versus cloud subscription cost analysis works through when that trade-off pays off.

09The DecisionA Mac or a GPU box?

MLX does not make Apple Silicon a universal winner — it makes it the right tool for a specific shape of problem. The honest framing is about memory versus bandwidth. A Mac gives you enormous unified capacity for the money and a clean train-plus-serve story; a discrete NVIDIA card gives you several times the memory bandwidth and the mature CUDA ecosystem. Apple Silicon training bandwidth still trails a high-end card meaningfully, so the Mac advantage is fitting big models in memory, not out-running a GPU on raw throughput.

Fine-tune mid-size models
Run a 14B–32B model that OOMs a consumer GPU

Unified memory lets a 32GB or 64GB Mac fine-tune models that a 24GB card cannot hold. If your bottleneck is capacity, not training speed, MLX on a well-specced Mac is the cheapest viable path.

Pick a Mac + MLX
On-device & Swift apps
Ship local AI inside an iOS or macOS app

MLXLanguageModel makes any mlx-community model a Foundation Models backend. For app developers who want private, offline inference with native streaming and tool calling, nothing else is this integrated.

Pick MLX + Foundation Models
Maximum training throughput
Train fast on large datasets

When raw training speed dominates and the model fits in VRAM, a discrete NVIDIA card with its higher bandwidth and CUDA ecosystem still wins. MLX closes the memory gap, not the bandwidth gap.

Pick a discrete GPU
Production agent serving
Serve concurrent agents at scale

A single Mac with mlx_lm.server handles local agent loops and privacy-bound work well. For high-concurrency production serving, pair it with cloud burst capacity rather than scaling Macs alone.

Hybrid: Mac + cloud burst

Looking forward, the trajectory favors MLX for an expanding slice of developer work. Each release narrows the software gap with CUDA, Thunderbolt clustering chips away at the single-machine memory ceiling, and the Foundation Models integration gives MLX a distribution channel no competing framework has — every Mac and iPhone app. The constraint that will not move quickly is bandwidth: until Apple closes that gap, the durable MLX thesis is capacity and integration, not peak speed. For teams weighing local versus cloud inference as part of a broader build, our AI transformation engagements start with exactly this kind of hardware-and-cost eval.

10ConclusionThe most integrated local-AI stack on any laptop.

Where MLX stands, June 2026

MLX turned the Mac into a credible local-AI workstation — by capacity, not by brute force.

Two and a half years after launch, MLX has matured into the default way to do serious local AI on Apple Silicon. The unified-memory model removes the VRAM wall, mlx-lm collapses run, quantize, fine-tune, and serve into one toolchain, and the M5's Neural Accelerators reset prompt-processing speed. The pieces that were missing — a Swift-native backend, an OpenAI-compatible server, distributed inference — all landed in 2026.

The honest caveat is the one worth repeating: Apple Silicon wins on memory capacity and integration, not on raw bandwidth. A high-end discrete GPU is still faster per dollar of throughput when the model fits in VRAM. But for the developer who wants to fine-tune a mid-size model on a laptop, ship private inference inside an app, or run an agent loop with zero marginal cost, MLX is no longer the experimental option — it is the obvious one.

The practical move is to benchmark on your own models and prompts rather than trust any single headline figure, vendor or community. Pin a known-good MLX version, measure TTFT and decode separately, and size your Mac from the weight-footprint math rather than a rule of thumb. Do that, and a Mac you may already own becomes a capable, private, recurring-cost-free AI workstation.

Put local AI into production

Unified memory makes a Mac a serious AI workstation — if you size it right.

We help engineering teams evaluate, benchmark, and deploy local and hybrid AI stacks — MLX on Apple Silicon, discrete-GPU boxes, and cloud-burst routing — with honest cost and capability math, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Local & hybrid AI engagements

  • MLX benchmarking on your own models and prompts
  • Mac vs discrete-GPU hardware sizing
  • On-device inference for iOS / macOS apps
  • Local agent loops via OpenAI-compatible servers
  • Cost & routing programs for local + cloud mix
FAQ · Apple MLX guide

The questions developers ask about MLX.

MLX is an open-source array framework — comparable to NumPy or JAX — built specifically for Apple Silicon by Apple Machine Learning Research. It was first released on December 5, 2023 and reached stable version v0.31.2 on April 22, 2026, with roughly 27,300 GitHub stars by mid-2026. Its defining feature is a unified-memory model: arrays live in one pool shared by the CPU and GPU, so operations dispatch across devices with no data copying. The Python API closely follows NumPy, and there are fully featured Swift, C++, and C APIs that mirror it. MLX is the foundation; mlx-lm is the higher-level library most developers use for running and fine-tuning language models on top of it.
Related dispatches

Continue exploring local AI on Apple Silicon.