Run a local LLM in 2026 and the first real decision is not which model — it is which runtime. Ollama, LM Studio, llama.cpp, vLLM, and MLX are the five engines worth knowing, and they are not interchangeable. One is a one-command desktop tool; another saturates eight NVIDIA GPUs under production load; a third only exists on Apple Silicon. Pick wrong and you either fight an over-engineered server on a laptop or starve a multi-user app of throughput.

The headline numbers do not help. A widely-cited Red Hat benchmark clocked vLLM at 793 tokens per second versus Ollama’s 41 on the same A100 — a 19x gap that gets quoted endlessly. What rarely travels with it is the caveat: at a single user, the two finish in a near dead heat. The gap is real, but it lives almost entirely under concurrency. Most guides lead with the 19x and bury the context.

This guide compares all five runtimes the way you actually choose between them — by chip, by concurrency, and by how much you want to operate. You get a five-runtime decision matrix, the Apple-Silicon format question (GGUF versus MLX) that most comparisons skip, a VRAM cheat sheet to map your hardware to a model size, and a clear read on when vLLM’s production muscle is worth the Docker. Every figure is traced to its source, and the vendor-stated ones are flagged as such.

Key takeaways

01
Five runtimes, five distinct jobs.Ollama is the one-command solo-dev default; LM Studio is a GUI plus a headless daemon; llama.cpp is the universal engine and edge option; vLLM is the production multi-user server; mlx-lm is the Apple Silicon fine-tuning and throughput path.
02
The 19x throughput gap is a concurrency story.Red Hat measured vLLM at 793 tok/s versus Ollama at 41 at peak load, but the two are roughly tied (~130-180 tok/s) at a single user. vLLM's advantage is serving many requests at once, not raw single-stream speed.
03
On a Mac, Ollama is no longer just a llama.cpp wrapper.Since early 2026 Ollama runs an MLX backend on Apple Silicon instead of the old llama.cpp Metal path. Ollama's own benchmark reports decode throughput roughly doubling — vendor-stated, but directionally real.
04
On Apple Silicon, the model format matters too.MLX models run a touch smaller on disk and modestly faster than GGUF on M-series chips, per a single third-party benchmark. You choose a runtime and a format together, not separately.
05
VRAM is the hard constraint, not speed.A 70B model at Q4_K_M needs roughly 38-48 GB of weights before any context, and its KV cache alone can grow from ~1.6 GB at 2K tokens to 42+ GB at 128K. Size the model to the silicon you have.

01 — The 19x GapThe 19x gap is a concurrency story, not a speed story.

The most-quoted number in the local-LLM world comes from a Red Hat Developer benchmark that pitted vLLM against Ollama on a single A100-PCIE-40GB running Llama 3.1-8B-Instruct, measured with GuideLLM over 300-second runs. At peak load, vLLM hit 793 tokens per second against Ollama’s 41 — the 19x gap everyone repeats.

The number that almost never travels with it: at a single user, both tools land in the same 130-180 tok/s band. The chart below shows what actually happens as you add concurrency. At eight simultaneous users, vLLM serves 187 tok/s to Ollama’s 82 (tuned), and tail latency tells the same story — vLLM’s P99 stays near 80 ms at peak while Ollama’s climbs to 673 ms. The mechanism is structural: vLLM’s PagedAttention and continuous batching pack many in-flight requests onto the GPU, whereas Ollama processes a near-serial queue.

Throughput by load · vLLM vs Ollama (tokens/sec)

Source: Red Hat Developer benchmark, Aug 2025 — GuideLLM on A100-PCIE-40GB, Llama 3.1-8B

Single user — both toolsOllama ≈ vLLM, 130-180 tok/s — roughly tied

≈155

vLLM — 8 concurrent userscontinuous batching engaged

187

Ollama — 8 concurrent users (tuned)near-serial request queue

vLLM — peak throughputsaturated with concurrent requests

793

Ollama — peak throughput (default)same A100, same model

vLLMOllama

The independent read

Red Hat’s own conclusion was blunt: vLLM is the superior choice for production deployment, handling enterprise-grade, high-concurrency applications with efficient resource management. The crucial qualifier, in their own data, is high-concurrency. For a single developer running one model on one machine, the 19x gap collapses to near parity — which is exactly why most people running a model locally never feel it.

The 19x gap is real — but it lives entirely at peak concurrency. At one user, Ollama and vLLM finish in a near dead heat.— Digital Applied, editorial synthesis of the Red Hat benchmark

02 — The FieldFive runtimes, each built for a different job.

Before the matrix, the cast. These five tools dominate local inference in 2026, and the fastest way to choose badly is to treat them as five flavors of the same thing. They are not — one is even the literal engine inside two of the others.

Quick start

Ollama

Go + llama.cpp · 175K★

One command to pull and run a model, with an OpenAI-compatible API at localhost:11434/v1. On Apple Silicon it now runs an MLX backend, not just llama.cpp. The right first stop for solo devs and prototyping on any OS.

ollama.com

Desktop + daemon

LM Studio

GUI · GGUF + MLX · free for work

A polished GUI that runs GGUF and MLX models side by side, plus llmster — a headless daemon for CI/CD and small-team servers, driven by the lms CLI. Free for personal and commercial use since July 2025.

lmstudio.ai

The engine

llama.cpp

pure C/C++ · 119K★

The inference engine underneath Ollama and LM Studio's GGUF path. No runtime dependencies, the widest hardware backend list in the field (Metal, CUDA, ROCm, Vulkan, SYCL, CPU), and quantization from 1.5-bit to 8-bit.

github.com/ggml-org/llama.cpp

Production server

vLLM

PagedAttention · 200+ models

Built for high-concurrency serving on NVIDIA (with AMD and TPU support). PagedAttention plus continuous batching produce the peak-throughput numbers. Speaks OpenAI, Anthropic Messages, and gRPC. Originated at UC Berkeley's Sky Computing Lab.

docs.vllm.ai

Apple only

MLX / mlx-lm

Apple Silicon M1+ · pip install

Apple's own array framework for M-series chips. No built-in server, but the fastest path to fine-tuning and maximum throughput on a Mac, with thousands of pre-quantized models on the mlx-community hub. Does not run on Intel Macs.

github.com/ml-explore/mlx-lm

One practical thread ties four of the five together: an OpenAI-compatible endpoint. Ollama, LM Studio, llama.cpp’s llama-server, and vLLM all expose OpenAI-style routes, so most client code that targets the OpenAI SDK can repoint at localhost with a one-line base-URL change. The exception is mlx-lm, which ships no server of its own and is usually wrapped by Ollama or LM Studio. Watch one edge: Ollama’s compatibility layer omits a few fields the real API supports — logprobs, tool-choice, and logit-bias among them — so apps that lean on those features need to account for the gap. If you are still deciding whether to self-host open-weight models at all, settle that first; this post assumes you have.

LM Studio also pairs with LM Link to drive models from a phone — we cover that separately in our guide to running your largest local models from your iPhone via LM Link. It is a genuinely different use case from the runtime-selection question here, so we will not rehash it.

03 — Decision MatrixOne matrix, five runtimes, the dimensions that decide it.

Most comparisons cover two or three of these tools and skip llama.cpp-as-a-standalone and mlx-lm entirely. Here is the full field across the axes that actually drive the choice: how hard it is to stand up, what hardware it targets, what model formats it eats, how it handles concurrency, and the situation it is built for.

Five local LLM runtimes — Ollama, LM Studio, llama.cpp, vLLM, and mlx-lm — compared across setup effort, target hardware, model formats, concurrency model, and best-fit scenario.
Runtime	Setup	Target hardware	Model formats	Concurrency	Best fit
Ollama	One command	Mac (MLX) / CUDA / CPU	GGUF (auto-pull)	Low–moderate (native queue)	Solo dev, prototyping, any OS
LM Studio	GUI install	Mac, Windows, Linux	GGUF + MLX	Moderate (continuous batching)	Desktop users + small-team server
llama.cpp	CLI flags	Everything (Metal/CUDA/ROCm/Vulkan/CPU)	GGUF	Low–moderate (--parallel N)	Max hardware coverage, edge, air-gapped
vLLM	Docker / Python	NVIDIA (primary), AMD, TPU	HF native + GGUF + FP8	Very high (PagedAttention)	Production multi-user serving, multi-GPU
mlx-lm	pip + CLI	Apple Silicon only (M1+)	MLX	CLI (no built-in server)	Fine-tuning + max Apple throughput

Read the matrix as a gradient of intent. Ollama and LM Studio optimize for “running a model in the next five minutes.” llama.cpp optimizes for “running on hardware nobody else supports,” from a Raspberry Pi to an air-gapped server. vLLM optimizes for “serving the model to a hundred people at once.” mlx-lm optimizes for “wringing the most out of a Mac, and fine-tuning on it.” The columns that flip your answer fastest are concurrency and target hardware — everything else is preference.

04 — Apple SiliconOn a Mac, Ollama is no longer just a wrapper.

For two years the standard mental model was “Ollama is llama.cpp with a nice CLI.” On Apple Silicon, that stopped being true in early 2026. Ollama added an MLX backend in preview, then promoted it to stable mid-year, replacing the llama.cpp Metal path on M-series Macs. In practice, today Ollama is MLX on a Mac — which collapses the old “Ollama versus MLX” framing that most 2025 comparisons still use.

The payoff, per Ollama’s own benchmark on an M5 Max running a Qwen 3.5 mixture-of-experts model, was a roughly doubled decode rate and a prefill lift of more than half versus the previous Metal backend. Those figures are vendor-stated and not yet widely replicated, so treat the magnitude as directional rather than exact — but the direction is clear, and it reflects a broader shift: Apple Silicon is now a legitimate inference platform, not a curiosity.

Why Apple Silicon got serious

Apple Machine Learning Research reports that on MLX it measured up to a 4x speedup in time-to-first-token on the M5 versus an M4 baseline, attributed to the M5’s Neural Accelerators. Decode speed — the token-generation rate you feel during a long answer — improves a more modest 1.19x to 1.27x, because decode is bound by memory bandwidth, and the M5’s 153 GB/s is only about 28% above the M4’s 120 GB/s.

TTFT speedup

M5 vs M4 (Qwen3-14B-4bit)

4.06×

Apple's MLX research measured up to a 4x faster time-to-first-token on M5 versus M4, driven by the M5's Neural Accelerators. Prompt-heavy and agentic workloads — long inputs, short outputs — benefit most.

Apple-reported

Decode speedup

Token generation, M5 vs M4

1.2×

Generation speed improves a more modest 1.19-1.27x across models. Decode is memory-bandwidth-bound, so it tracks the bandwidth gain far more closely than it tracks raw compute.

Apple-reported

Memory bandwidth

M5 vs M4's 120 GB/s

153GB/s

A 28% bandwidth lift. Because local decode is bandwidth-bound, this is the number that most directly caps how fast a given model generates tokens on a Mac — more than core count or clock.

+28% vs M4

The bandwidth point is the one to internalize, because it governs the whole local-on-Mac economics. If you are weighing a high-memory Mac against a cloud subscription, the trade is real but not automatic — we walk through it in our piece on calculating the ROI of local AI versus cloud subscriptions. The short version: unified memory lets a Mac load models a consumer GPU cannot, but bandwidth, not capacity, sets how fast they run.

05 — GGUF vs MLXThe format choice matters as much as the runtime.

On Apple Silicon you are really making two decisions at once: which runtime, and which model format. GGUF — the llama.cpp lineage — is the universal, cross-platform option; it runs everywhere and is what Ollama auto-fetches. MLX is Apple-native and only runs on M-series chips. The two are not interchangeable files, and the choice has measurable consequences.

One third-party benchmark found MLX models a few percent smaller on disk and modestly faster on M-series hardware than their GGUF equivalents — for example, Llama 3.1 8B measured at 4.92 GB in GGUF versus 4.53 GB in MLX, and Qwen 2.5 32B at 19.8 GB versus 17.9 GB. Because that comparison is single-source, treat the throughput edge as directional, not a guaranteed number. Quality is close at 4-bit; the same write-up suggests GGUF’s Q4_K_M holds quality marginally better below 8B. If you want the full picture on what each quantization level costs you, choosing the right quantization level is its own decision worth getting right.

The Apple-Silicon rule of thumb

If you are on an M-series Mac, default to MLX for the best speed and the smallest footprint — Ollama now picks it automatically, and LM Studio lets you run GGUF and MLX side by side in one session. Keep a GGUF copy of any model you also need to run on a non-Apple box, so the same weights travel across your fleet without a re-download.

06 — VRAM MathHow big a model actually fits.

No runtime can run a model that does not fit in memory, so the hardest constraint is rarely speed — it is VRAM (or, on a Mac, unified memory). Weight memory scales almost linearly: estimate it as parameters × bytes-per-weight, where FP16 is 2.0 bytes, Q8_0 is about 1.0, and Q4_K_M is about 0.6 bytes per weight. The cheat sheet below applies that formula straight down the model ladder.

Approximate weight memory in gigabytes by model size and quantization, computed as parameters times bytes-per-weight, with a practical VRAM or RAM recommendation that adds KV-cache and runtime overhead.
Model size	FP16 (×2.0 B)	Q8_0 (×1.0 B)	Q4_K_M (×0.6 B)	Comfortable VRAM/RAM
1B	2.0 GB	1.0 GB	0.6 GB	2 GB
3B	6.0 GB	3.0 GB	1.8 GB	4 GB
7B	14.0 GB	7.0 GB	4.2 GB	8 GB
13B	26.0 GB	13.0 GB	7.8 GB	16 GB
30B	60.0 GB	30.0 GB	18.0 GB	24 GB
70B	140.0 GB	70.0 GB	42.0 GB	48 GB+

The Q4_K_M column is the one most people live in: a 7B fits in roughly 4 GB of weights, a 13B in under 8 GB, and a 70B in the 38-48 GB band that independent guides converge on. The “comfortable” column adds the KV-cache and runtime headroom you need in practice — which is why a 7B that weighs 4.2 GB still wants an 8 GB card, and why a 70B realistically needs 48 GB before you give it real context.

The number weights don't show

Weights are only half the memory story. A 70B model’s KV cache alone can grow from about 1.6 GB at a 2K-token context to 42+ GB at 128K — large enough to rival the weights themselves. On long-context and agentic workloads, the context window, not the parameter count, is often what decides whether the model fits. If you are building on-device AI agents and the local inference stack around it, budget the KV cache first.

07 — ProductionvLLM earns its complexity exactly once.

vLLM is the only runtime here that asks for Docker or a Python serving stack, and it is the only one that rewards the effort with serious production capability. Its signature is PagedAttention, which stores KV-cache tensors in non-contiguous, virtual-memory-style paged blocks instead of one pre-allocated contiguous chunk per sequence. That eliminates the memory fragmentation that throttles naive servers, and it is the mechanism behind the peak-throughput numbers from Section 01.

Two more capabilities make it a genuine SaaS backend rather than a faster Ollama. It supports multi-LoRA serving — hundreds of fine-tuned adapters can share one base model with per-request adapter switching and no reload latency, which is exactly what you want when every customer has their own fine-tune. And it covers 200+ model architectures while speaking OpenAI, the Anthropic Messages API, and gRPC, so it slots into most existing client code. SGLang is a credible competitor that can edge vLLM on shared-prefix workloads like RAG in some benchmarks, but for raw model variety and the fastest path to a running server, vLLM is the more common default.

Reach for vLLM

Many users, one endpoint

Serving a model to a real user base, hosting multiple fine-tunes behind one API, or scaling across multiple GPUs. PagedAttention and continuous batching are what you are paying the ops cost for — and they pay back under concurrency.

Pick vLLM

Skip vLLM

Single-user, single machine

On a laptop or a one-developer workflow, vLLM's Docker and NVIDIA-first posture is overhead with no payoff — the throughput edge only appears under concurrency you do not have. Ollama or LM Studio will feel identical and start faster.

Use Ollama / LM Studio

Enterprise context

The headline cluster numbers

Vendor figures like 2,200 tok/s per H200 for a frontier MoE come from enterprise GPUs that cost five figures each and specific parallelism configs. Real-world latency varies with batch size — useful as a ceiling, not a budget input.

Treat as ceiling, not norm

The honest framing: vLLM is a production tool that happens to run locally, not a local tool that happens to scale. If the workload is a multi-user app, it is worth every line of the compose file — and the crossover point where self-hosting beats per-token API pricing is a real calculation, which we lay out in our look at when cloud inference is actually cheaper than local GPU hardware. If it is one developer and one model, the simpler runtimes win on every axis that matters to you.

08 — Make The CallMatch the runtime to your situation.

Strip away the benchmarks and the decision is mostly about who you are and what you are building. Here is the fast read by situation — and the privacy case for going local at all is covered in our guide to the privacy benefits of local inference if you still need to make it.

Solo dev, any laptop

Fastest time to a running model

One command, sensible defaults, an OpenAI-compatible API, and an MLX backend on Mac you get for free. Start here; you can always graduate later.

Pick Ollama

Mac power user

A GUI plus a headless option

Run GGUF and MLX side by side, manage models visually, then flip on the llmster daemon for CI/CD or a small-team server. Free for commercial use, so there is no licensing objection.

Pick LM Studio

Edge / air-gapped / exotic HW

Run anywhere, depend on nothing

Pure C/C++, no runtime dependencies, and the broadest backend list in the field — Metal, CUDA, ROCm, Vulkan, SYCL, plain CPU. The right tool for embedded, offline, and hardware nobody else supports.

Pick llama.cpp

Production multi-user app

Serve many, fine-tune many

PagedAttention throughput, multi-LoRA hosting, multi-GPU scaling, and OpenAI plus Anthropic Messages compatibility. Accept the Docker and NVIDIA-first reality in exchange for real serving capability.

Pick vLLM

Fine-tuning on a Mac

Maximum Apple Silicon throughput

The native path when you are fine-tuning or squeezing every token-per-second out of an M-series chip. No built-in server, so wrap it with Ollama or LM Studio when you need an endpoint.

Pick mlx-lm

Most teams end up running two of these, not one: a friendly tool for local development and a production server for the deployed app. If you are standing up local or self-hosted inference as part of a broader build, our AI digital transformation engagements start with exactly this kind of runtime-and-hardware decision, and our web development work wires the resulting endpoint into a real product.

09 — ConclusionThe right runtime depends on your chip, not your preference.

The shape of local inference, June 2026

There is no best local LLM runtime — only the right one for your load.

The five runtimes are not competitors so much as specialists. Ollama and LM Studio own the desktop. llama.cpp is the universal engine underneath them and the only realistic option on exotic or air-gapped hardware. vLLM is the production server that earns its complexity the moment you have real concurrency. mlx-lm is the Apple Silicon throughput-and-fine-tuning path. The fastest way to choose badly is to read one benchmark headline and pick the “winner.”

That 19x throughput gap is the cleanest example. It is real, and it is also irrelevant to most people who run a model locally, because it only opens under concurrency a single developer never generates. The same trap waits in every vendor tok/s figure and every enterprise-GPU number: true in its context, misleading out of it.

The practical move is to size the model to your silicon first, pick the runtime that matches your concurrency and your chip second, and benchmark on your own prompts before trusting anyone’s headline — including this one. Local inference in 2026 is genuinely good enough to build on. The work is matching the tool to the job, and the matrix above is where that starts.

Run Local LLMs in 2026: Ollama vs LM Studio vs vLLM