Run a local LLM in 2026 and the first real decision is not which model — it is which runtime. Ollama, LM Studio, llama.cpp, vLLM, and MLX are the five engines worth knowing, and they are not interchangeable. One is a one-command desktop tool; another saturates eight NVIDIA GPUs under production load; a third only exists on Apple Silicon. Pick wrong and you either fight an over-engineered server on a laptop or starve a multi-user app of throughput.
The headline numbers do not help. A widely-cited Red Hat benchmark clocked vLLM at 793 tokens per second versus Ollama’s 41 on the same A100 — a 19x gap that gets quoted endlessly. What rarely travels with it is the caveat: at a single user, the two finish in a near dead heat. The gap is real, but it lives almost entirely under concurrency. Most guides lead with the 19x and bury the context.
This guide compares all five runtimes the way you actually choose between them — by chip, by concurrency, and by how much you want to operate. You get a five-runtime decision matrix, the Apple-Silicon format question (GGUF versus MLX) that most comparisons skip, a VRAM cheat sheet to map your hardware to a model size, and a clear read on when vLLM’s production muscle is worth the Docker. Every figure is traced to its source, and the vendor-stated ones are flagged as such.
- 01Five runtimes, five distinct jobs.Ollama is the one-command solo-dev default; LM Studio is a GUI plus a headless daemon; llama.cpp is the universal engine and edge option; vLLM is the production multi-user server; mlx-lm is the Apple Silicon fine-tuning and throughput path.
- 02The 19x throughput gap is a concurrency story.Red Hat measured vLLM at 793 tok/s versus Ollama at 41 at peak load, but the two are roughly tied (~130-180 tok/s) at a single user. vLLM's advantage is serving many requests at once, not raw single-stream speed.
- 03On a Mac, Ollama is no longer just a llama.cpp wrapper.Since early 2026 Ollama runs an MLX backend on Apple Silicon instead of the old llama.cpp Metal path. Ollama's own benchmark reports decode throughput roughly doubling — vendor-stated, but directionally real.
- 04On Apple Silicon, the model format matters too.MLX models run a touch smaller on disk and modestly faster than GGUF on M-series chips, per a single third-party benchmark. You choose a runtime and a format together, not separately.
- 05VRAM is the hard constraint, not speed.A 70B model at Q4_K_M needs roughly 38-48 GB of weights before any context, and its KV cache alone can grow from ~1.6 GB at 2K tokens to 42+ GB at 128K. Size the model to the silicon you have.
01 — The 19x GapThe 19x gap is a concurrency story, not a speed story.
The most-quoted number in the local-LLM world comes from a Red Hat Developer benchmark that pitted vLLM against Ollama on a single A100-PCIE-40GB running Llama 3.1-8B-Instruct, measured with GuideLLM over 300-second runs. At peak load, vLLM hit 793 tokens per second against Ollama’s 41 — the 19x gap everyone repeats.
The number that almost never travels with it: at a single user, both tools land in the same 130-180 tok/s band. The chart below shows what actually happens as you add concurrency. At eight simultaneous users, vLLM serves 187 tok/s to Ollama’s 82 (tuned), and tail latency tells the same story — vLLM’s P99 stays near 80 ms at peak while Ollama’s climbs to 673 ms. The mechanism is structural: vLLM’s PagedAttention and continuous batching pack many in-flight requests onto the GPU, whereas Ollama processes a near-serial queue.
Throughput by load · vLLM vs Ollama (tokens/sec)
Source: Red Hat Developer benchmark, Aug 2025 — GuideLLM on A100-PCIE-40GB, Llama 3.1-8BThe 19x gap is real — but it lives entirely at peak concurrency. At one user, Ollama and vLLM finish in a near dead heat.— Digital Applied, editorial synthesis of the Red Hat benchmark
02 — The FieldFive runtimes, each built for a different job.
Before the matrix, the cast. These five tools dominate local inference in 2026, and the fastest way to choose badly is to treat them as five flavors of the same thing. They are not — one is even the literal engine inside two of the others.
Ollama
One command to pull and run a model, with an OpenAI-compatible API at localhost:11434/v1. On Apple Silicon it now runs an MLX backend, not just llama.cpp. The right first stop for solo devs and prototyping on any OS.
LM Studio
A polished GUI that runs GGUF and MLX models side by side, plus llmster — a headless daemon for CI/CD and small-team servers, driven by the lms CLI. Free for personal and commercial use since July 2025.
llama.cpp
The inference engine underneath Ollama and LM Studio's GGUF path. No runtime dependencies, the widest hardware backend list in the field (Metal, CUDA, ROCm, Vulkan, SYCL, CPU), and quantization from 1.5-bit to 8-bit.
vLLM
Built for high-concurrency serving on NVIDIA (with AMD and TPU support). PagedAttention plus continuous batching produce the peak-throughput numbers. Speaks OpenAI, Anthropic Messages, and gRPC. Originated at UC Berkeley's Sky Computing Lab.
MLX / mlx-lm
Apple's own array framework for M-series chips. No built-in server, but the fastest path to fine-tuning and maximum throughput on a Mac, with thousands of pre-quantized models on the mlx-community hub. Does not run on Intel Macs.
One practical thread ties four of the five together: an OpenAI-compatible endpoint. Ollama, LM Studio, llama.cpp’s llama-server, and vLLM all expose OpenAI-style routes, so most client code that targets the OpenAI SDK can repoint at localhost with a one-line base-URL change. The exception is mlx-lm, which ships no server of its own and is usually wrapped by Ollama or LM Studio. Watch one edge: Ollama’s compatibility layer omits a few fields the real API supports — logprobs, tool-choice, and logit-bias among them — so apps that lean on those features need to account for the gap. If you are still deciding whether to self-host open-weight models at all, settle that first; this post assumes you have.
LM Studio also pairs with LM Link to drive models from a phone — we cover that separately in our guide to running your largest local models from your iPhone via LM Link. It is a genuinely different use case from the runtime-selection question here, so we will not rehash it.
03 — Decision MatrixOne matrix, five runtimes, the dimensions that decide it.
Most comparisons cover two or three of these tools and skip llama.cpp-as-a-standalone and mlx-lm entirely. Here is the full field across the axes that actually drive the choice: how hard it is to stand up, what hardware it targets, what model formats it eats, how it handles concurrency, and the situation it is built for.
| Runtime | Setup | Target hardware | Model formats | Concurrency | Best fit |
|---|---|---|---|---|---|
| Ollama | One command | Mac (MLX) / CUDA / CPU | GGUF (auto-pull) | Low–moderate (native queue) | Solo dev, prototyping, any OS |
| LM Studio | GUI install | Mac, Windows, Linux | GGUF + MLX | Moderate (continuous batching) | Desktop users + small-team server |
| llama.cpp | CLI flags | Everything (Metal/CUDA/ROCm/Vulkan/CPU) | GGUF | Low–moderate (--parallel N) | Max hardware coverage, edge, air-gapped |
| vLLM | Docker / Python | NVIDIA (primary), AMD, TPU | HF native + GGUF + FP8 | Very high (PagedAttention) | Production multi-user serving, multi-GPU |
| mlx-lm | pip + CLI | Apple Silicon only (M1+) | MLX | CLI (no built-in server) | Fine-tuning + max Apple throughput |
Read the matrix as a gradient of intent. Ollama and LM Studio optimize for “running a model in the next five minutes.” llama.cpp optimizes for “running on hardware nobody else supports,” from a Raspberry Pi to an air-gapped server. vLLM optimizes for “serving the model to a hundred people at once.” mlx-lm optimizes for “wringing the most out of a Mac, and fine-tuning on it.” The columns that flip your answer fastest are concurrency and target hardware — everything else is preference.
04 — Apple SiliconOn a Mac, Ollama is no longer just a wrapper.
For two years the standard mental model was “Ollama is llama.cpp with a nice CLI.” On Apple Silicon, that stopped being true in early 2026. Ollama added an MLX backend in preview, then promoted it to stable mid-year, replacing the llama.cpp Metal path on M-series Macs. In practice, today Ollama is MLX on a Mac — which collapses the old “Ollama versus MLX” framing that most 2025 comparisons still use.
The payoff, per Ollama’s own benchmark on an M5 Max running a Qwen 3.5 mixture-of-experts model, was a roughly doubled decode rate and a prefill lift of more than half versus the previous Metal backend. Those figures are vendor-stated and not yet widely replicated, so treat the magnitude as directional rather than exact — but the direction is clear, and it reflects a broader shift: Apple Silicon is now a legitimate inference platform, not a curiosity.
M5 vs M4 (Qwen3-14B-4bit)
Apple's MLX research measured up to a 4x faster time-to-first-token on M5 versus M4, driven by the M5's Neural Accelerators. Prompt-heavy and agentic workloads — long inputs, short outputs — benefit most.
Token generation, M5 vs M4
Generation speed improves a more modest 1.19-1.27x across models. Decode is memory-bandwidth-bound, so it tracks the bandwidth gain far more closely than it tracks raw compute.
M5 vs M4's 120 GB/s
A 28% bandwidth lift. Because local decode is bandwidth-bound, this is the number that most directly caps how fast a given model generates tokens on a Mac — more than core count or clock.
The bandwidth point is the one to internalize, because it governs the whole local-on-Mac economics. If you are weighing a high-memory Mac against a cloud subscription, the trade is real but not automatic — we walk through it in our piece on calculating the ROI of local AI versus cloud subscriptions. The short version: unified memory lets a Mac load models a consumer GPU cannot, but bandwidth, not capacity, sets how fast they run.
05 — GGUF vs MLXThe format choice matters as much as the runtime.
On Apple Silicon you are really making two decisions at once: which runtime, and which model format. GGUF — the llama.cpp lineage — is the universal, cross-platform option; it runs everywhere and is what Ollama auto-fetches. MLX is Apple-native and only runs on M-series chips. The two are not interchangeable files, and the choice has measurable consequences.
One third-party benchmark found MLX models a few percent smaller on disk and modestly faster on M-series hardware than their GGUF equivalents — for example, Llama 3.1 8B measured at 4.92 GB in GGUF versus 4.53 GB in MLX, and Qwen 2.5 32B at 19.8 GB versus 17.9 GB. Because that comparison is single-source, treat the throughput edge as directional, not a guaranteed number. Quality is close at 4-bit; the same write-up suggests GGUF’s Q4_K_M holds quality marginally better below 8B. If you want the full picture on what each quantization level costs you, choosing the right quantization level is its own decision worth getting right.
06 — VRAM MathHow big a model actually fits.
No runtime can run a model that does not fit in memory, so the hardest constraint is rarely speed — it is VRAM (or, on a Mac, unified memory). Weight memory scales almost linearly: estimate it as parameters × bytes-per-weight, where FP16 is 2.0 bytes, Q8_0 is about 1.0, and Q4_K_M is about 0.6 bytes per weight. The cheat sheet below applies that formula straight down the model ladder.
| Model size | FP16 (×2.0 B) | Q8_0 (×1.0 B) | Q4_K_M (×0.6 B) | Comfortable VRAM/RAM |
|---|---|---|---|---|
| 1B | 2.0 GB | 1.0 GB | 0.6 GB | 2 GB |
| 3B | 6.0 GB | 3.0 GB | 1.8 GB | 4 GB |
| 7B | 14.0 GB | 7.0 GB | 4.2 GB | 8 GB |
| 13B | 26.0 GB | 13.0 GB | 7.8 GB | 16 GB |
| 30B | 60.0 GB | 30.0 GB | 18.0 GB | 24 GB |
| 70B | 140.0 GB | 70.0 GB | 42.0 GB | 48 GB+ |
The Q4_K_M column is the one most people live in: a 7B fits in roughly 4 GB of weights, a 13B in under 8 GB, and a 70B in the 38-48 GB band that independent guides converge on. The “comfortable” column adds the KV-cache and runtime headroom you need in practice — which is why a 7B that weighs 4.2 GB still wants an 8 GB card, and why a 70B realistically needs 48 GB before you give it real context.
07 — ProductionvLLM earns its complexity exactly once.
vLLM is the only runtime here that asks for Docker or a Python serving stack, and it is the only one that rewards the effort with serious production capability. Its signature is PagedAttention, which stores KV-cache tensors in non-contiguous, virtual-memory-style paged blocks instead of one pre-allocated contiguous chunk per sequence. That eliminates the memory fragmentation that throttles naive servers, and it is the mechanism behind the peak-throughput numbers from Section 01.
Two more capabilities make it a genuine SaaS backend rather than a faster Ollama. It supports multi-LoRA serving — hundreds of fine-tuned adapters can share one base model with per-request adapter switching and no reload latency, which is exactly what you want when every customer has their own fine-tune. And it covers 200+ model architectures while speaking OpenAI, the Anthropic Messages API, and gRPC, so it slots into most existing client code. SGLang is a credible competitor that can edge vLLM on shared-prefix workloads like RAG in some benchmarks, but for raw model variety and the fastest path to a running server, vLLM is the more common default.
Many users, one endpoint
Serving a model to a real user base, hosting multiple fine-tunes behind one API, or scaling across multiple GPUs. PagedAttention and continuous batching are what you are paying the ops cost for — and they pay back under concurrency.
Single-user, single machine
On a laptop or a one-developer workflow, vLLM's Docker and NVIDIA-first posture is overhead with no payoff — the throughput edge only appears under concurrency you do not have. Ollama or LM Studio will feel identical and start faster.
The headline cluster numbers
Vendor figures like 2,200 tok/s per H200 for a frontier MoE come from enterprise GPUs that cost five figures each and specific parallelism configs. Real-world latency varies with batch size — useful as a ceiling, not a budget input.
The honest framing: vLLM is a production tool that happens to run locally, not a local tool that happens to scale. If the workload is a multi-user app, it is worth every line of the compose file — and the crossover point where self-hosting beats per-token API pricing is a real calculation, which we lay out in our look at when cloud inference is actually cheaper than local GPU hardware. If it is one developer and one model, the simpler runtimes win on every axis that matters to you.
08 — Make The CallMatch the runtime to your situation.
Strip away the benchmarks and the decision is mostly about who you are and what you are building. Here is the fast read by situation — and the privacy case for going local at all is covered in our guide to the privacy benefits of local inference if you still need to make it.
Fastest time to a running model
One command, sensible defaults, an OpenAI-compatible API, and an MLX backend on Mac you get for free. Start here; you can always graduate later.
A GUI plus a headless option
Run GGUF and MLX side by side, manage models visually, then flip on the llmster daemon for CI/CD or a small-team server. Free for commercial use, so there is no licensing objection.
Run anywhere, depend on nothing
Pure C/C++, no runtime dependencies, and the broadest backend list in the field — Metal, CUDA, ROCm, Vulkan, SYCL, plain CPU. The right tool for embedded, offline, and hardware nobody else supports.
Serve many, fine-tune many
PagedAttention throughput, multi-LoRA hosting, multi-GPU scaling, and OpenAI plus Anthropic Messages compatibility. Accept the Docker and NVIDIA-first reality in exchange for real serving capability.
Maximum Apple Silicon throughput
The native path when you are fine-tuning or squeezing every token-per-second out of an M-series chip. No built-in server, so wrap it with Ollama or LM Studio when you need an endpoint.
Most teams end up running two of these, not one: a friendly tool for local development and a production server for the deployed app. If you are standing up local or self-hosted inference as part of a broader build, our AI digital transformation engagements start with exactly this kind of runtime-and-hardware decision, and our web development work wires the resulting endpoint into a real product.
09 — ConclusionThe right runtime depends on your chip, not your preference.
There is no best local LLM runtime — only the right one for your load.
The five runtimes are not competitors so much as specialists. Ollama and LM Studio own the desktop. llama.cpp is the universal engine underneath them and the only realistic option on exotic or air-gapped hardware. vLLM is the production server that earns its complexity the moment you have real concurrency. mlx-lm is the Apple Silicon throughput-and-fine-tuning path. The fastest way to choose badly is to read one benchmark headline and pick the “winner.”
That 19x throughput gap is the cleanest example. It is real, and it is also irrelevant to most people who run a model locally, because it only opens under concurrency a single developer never generates. The same trap waits in every vendor tok/s figure and every enterprise-GPU number: true in its context, misleading out of it.
The practical move is to size the model to your silicon first, pick the runtime that matches your concurrency and your chip second, and benchmark on your own prompts before trusting anyone’s headline — including this one. Local inference in 2026 is genuinely good enough to build on. The work is matching the tool to the job, and the matrix above is where that starts.