Choosing between GGUF, AWQ, GPTQ, EXL2 and MLX is the most common LLM quantization decision developers get wrong — and picking the wrong format can, by some estimates, cost a meaningful share of inference speed before you have tuned a single parameter. The formats are not interchangeable: each targets different hardware and trades accuracy against throughput in its own way. This guide is the map.
The confusion starts with a category error. GGUF is a self-contained file format — weights, tokenizer, and metadata bundled into one portable artifact. GPTQ and AWQ are quantization algorithms whose output is stored as ordinary Hugging Face safetensors. EXL2 and MLX are formats again, each welded to a single runtime. Treat them as one menu and you will pair a format with hardware it was never built for.
Below: a plain-English breakdown of all six options — GGUF, GPTQ, AWQ, EXL2, MLX and bitsandbytes — a proprietary table mapping each to its runtime and hardware fit, the counterintuitive benchmark data from a January 2026 evaluation, and a decision tree for CPU, NVIDIA, and Apple Silicon. If you want the orthogonal question of how many bits to keep, our companion piece on 4-bit vs 8-bit vs FP8 tradeoffs covers the precision dimension this post deliberately sets aside.
- 01Format and algorithm are not the same thing.GGUF is a self-contained container; GPTQ and AWQ are algorithms whose weights are stored as Hugging Face safetensors; EXL2 and MLX are formats bound to one runtime each. Pick on that axis before you pick a bit-width.
- 02GGUF is the universal default.It runs on CPU, NVIDIA, AMD and Apple Silicon, can split layers across CPU and GPU, and its k-quants match GPTQ-4 quality at the same average bit-width. The safe choice when hardware is mixed or unknown.
- 03On NVIDIA, AWQ has overtaken GPTQ.AWQ's activation-aware scaling cut the INT4 quantization penalty from 4.57 to 1.17 perplexity (about 74%) and it outperforms GPTQ on reasoning at the same 4 bits. New models now ship AWQ or GGUF first.
- 04EXL2 dials bitrate to your VRAM; MLX owns the Mac.EXL2 targets any fractional average between 2 and 8 bits per weight on a single NVIDIA GPU; MLX exploits Apple unified memory and is now the default Apple-Silicon backend in Ollama.
- 05bitsandbytes is the only training-safe format here.Its 4-bit NF4 path powers QLoRA fine-tuning by freezing a compressed base and training small adapters. Every other format in this comparison is inference-only.
01 — Format vs AlgorithmThe category error almost every comparison makes.
Before comparing speeds, fix the vocabulary. People say “quantization format” to mean three different kinds of thing, and that conflation is exactly how someone ends up with an EXL2 file they cannot load on a Mac, or expects a GGUF to run inside vLLM. There are three buckets.
A file format is a complete, self-contained artifact: the file you download is the model. GGUF, EXL2 and MLX are file formats, each tied to a specific runtime. A quantization algorithm is a procedure, not a file type — GPTQ and AWQ produce ordinary Hugging Face safetensors shards plus a config, which is why an “AWQ model” looks like any other Hugging Face repository. And on-the-fly quantization, the bitsandbytes approach, ships no pre-quantized file at all: weights are compressed at load time and decompressed during the forward pass.
File formats
The downloaded file is the artifact — weights, tokenizer and quant parameters bundled together, each tied to a specific runtime (llama.cpp, ExLlamaV2, mlx-lm).
Algorithms
A quantization procedure, not a file type. The output is standard Hugging Face safetensors shards plus a config — so the repo looks like any other HF model.
On-the-fly
No pre-quantized file at all. Weights are compressed at load time and decompressed per forward pass, which is what makes QLoRA fine-tuning possible.
02 — GGUFGGUF: the format that runs everywhere.
GGUF stands for “Georgi Gerganov Universal Format,” named after the creator of llama.cpp and ggml. It replaced the older GGML format in August 2023 — a distinction worth keeping straight, because GGML is the predecessor and GGUF is the current standard. A single GGUF file packages weights, tokenizer, metadata and quantization parameters into one portable artifact.
Its defining feature is hardware-agnosticism. GGUF runs on the CPU via llama.cpp, on NVIDIA GPUs through CUDA, on AMD through ROCm, and on Apple Silicon through Metal — and it can split a model’s layers between CPU and GPU, a trick called partial offload that lets you run a model slightly too large for your VRAM. No other format in this comparison spans that many targets.
On the quality side, llama.cpp’s K-quants use a super-block structure: weights are grouped into super-blocks and each block gets its own scaling factor, with the S/M/L suffix indicating block size (larger blocks mean more compression and slightly lower accuracy). K-quants consistently beat the older Q4_0 and Q5_0 formats at the same average bit-width. The table below uses independent perplexity and benchmark measurements from a January 2026 arXiv evaluation of llama.cpp quantization on Llama-3.1-8B-Instruct.
| GGUF type | Avg bits | Size cut | Perplexity | Aggregate | GSM8K | Use when |
|---|---|---|---|---|---|---|
| Reference & near-lossless | ||||||
| F16 (baseline) | 16 | 0% | 7.32 | 69.47 | 77.63 | Reference only — too large to ship |
| Q8_0 | 8 | 46.9% | 7.33 | 69.41 | 77.48 | Near-lossless quality needed |
| The sweet spot | ||||||
| Q5_0 | 5 | 65.2% | 7.43 | 69.92 | 79.08 | Best quality-per-byte on CPU |
| Q4_K_S | 4 | 70.8% | 7.62 | 69.17 | 77.33 | Tight VRAM on GPU |
| Aggressive compression | ||||||
| Q3_K_S | 3 | 77.2% | 8.96 | 65.49 | 68.31 | Maximum compression, clear cost |
Two things in that table are worth a second look. First, the Q4_K_M tier most people actually run sits just above Q4_K_S in bit allocation and is the popular default — it is meaningfully better than the legacy Q4_0 at the same nominal width because attention and embedding layers get a slightly higher bit budget. Second, and more surprising, the 5-bit Q5_0 tier edged out FP16 on this run — 69.92 versus 69.47 on the aggregate score, and 79.08 versus 77.63 on GSM8K math. The paper attributes this to a mild regularization effect from quantization noise. Treat it as a specific, repeatable finding for this model, not a universal rule that 5-bit always beats full precision.
GGUF is the format you load in LM Studio for local model management, in Ollama, or directly through llama.cpp. If your priority is getting a capable model running on whatever machine is in front of you — including a laptop with no discrete GPU — GGUF is almost always the right first move.
03 — GPTQGPTQ: the GPU standard that 2025 outgrew.
GPTQ was the format that made single-GPU inference of giant models possible. It can quantize a 175-billion-parameter model in roughly 4 GPU hours to 3 or 4 bits with negligible accuracy degradation, and it more than doubled compression versus the one-shot methods that came before it. For two years that was the GPU quantization story — one prolific Hugging Face contributor, TheBloke, uploaded more than 2,000 GPTQ-quantized models and effectively set the de-facto standard for 2023 and 2024.
The runtime story is solid: vLLM serves GPTQ through its Marlin and Machete custom kernels, optimized for Ampere (A100 and up) and Hopper (H100 and up) NVIDIA GPUs. GPTQ also supports extreme compression down to 2-bit and even ternary quantization — but here the paper claims and the production reality diverge. Community experience shows quality degrades meaningfully below 4 bits, so treat sub-4-bit GPTQ as experimental rather than something you ship.
The honest 2026 framing is that GPTQ has been largely superseded by AWQ for new model releases. It still has the largest legacy library and remains a perfectly reasonable choice for existing GPTQ checkpoints, but when a new model drops, it now lands in AWQ or GGUF first.
175B in roughly 4 GPU-hours
GPTQ was the first method to quantize a 175-billion-parameter model to 3 to 4 bits with negligible accuracy loss, enabling single-GPU inference of models that previously needed a cluster.
TheBloke's GPTQ uploads
One prolific Hugging Face contributor published over 2,000 GPTQ-quantized models, which is why GPTQ became the de-facto GPU standard across 2023 and 2024.
Extreme compression, real cost
GPTQ supports 2-bit and even ternary quantization, but community testing shows quality drops sharply below 4 bits — keep production GPTQ at 4-bit.
04 — AWQAWQ: protect the one percent that matters.
AWQ — Activation-Aware Weight Quantization, from MIT’s Han Lab — is built on a single sharp observation: weights are not equally important. Identifying and protecting just the top 1% of salient weights, selected by activation magnitude rather than by the weight values themselves, sharply reduces quantization error. AWQ does this with per-channel scaling informed by activation patterns, and it needs no backpropagation and no weight reconstruction to do it.
That idea won the Best Paper Award at MLSys 2024, and the numbers back it up. On INT4 with group size 128, AWQ dropped the perplexity penalty from 4.57 to 1.17 — about a 74% reduction in degradation — and at the same 4-bit depth it consistently outperforms GPTQ on reasoning and instruction-following benchmarks. It also runs fast: a reported 3x-plus speedup over the Hugging Face FP16 implementation on both desktop and mobile GPUs. AWQ models on Hugging Face have passed 6 million downloads.
Practically, AWQ has become the default for production GPU inference. It is natively supported across the major NVIDIA stacks — vLLM, Hugging Face TGI, NVIDIA TensorRT-LLM and LMDeploy — and is primarily designed for NVIDIA and edge or mobile GPU acceleration. One caveat on the marketing line: AWQ’s “lossless” claim refers specifically to vision-language models like VILA. For most text LLMs, AWQ at 4-bit is near-lossless, which is excellent, but not literally lossless — benchmark it on your own prompts before treating quality as a solved problem.
The GPTQ-to-AWQ handoff is the clearest trend in the GPU quantization space. It is not that GPTQ stopped working; it is that AWQ’s activation-aware approach delivers better quality at the same bit budget with broader first-class runtime support, so the ecosystem moved its default. When a format wins on both quality and tooling at once, the rest of the field follows within a release cycle or two — which is exactly what happened across 2025 and 2026.
05 — EXL2EXL2: dial the bitrate to fit your VRAM.
EXL2, the format used by ExLlamaV2, has a genuinely different idea at its core: instead of committing to a whole number of bits, it lets you target any average bitrate between 2 and 8 bits per weight by mixing quantization levels within a single model. Want 4.65 bits per weight to exactly fill a 24GB card? You can ask for it. The algorithm measures quantization error per matrix and allocates more bits to the layers that are most sensitive to it.
EXL2 supports 2, 3, 4, 5, 6 and 8-bit quantization, and allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight.— turboderp, ExLlamaV2 author
The trade-off is reach: EXL2 is NVIDIA-only. It does not support AMD, Apple Silicon, or CPU inference. It is backward-compatible with 4-bit GPTQ models, and the recommended production stack pairs ExLlamaV2 with TabbyAPI as a local API server. Where it earns its keep is raw single-GPU throughput — users report 2 to 5x speed increases versus standard loaders for quantized models on single-GPU setups.
TinyLlama on an RTX 4090
At the small end, EXL2 pushes token generation into the high hundreds. A vendor-stated single-batch figure from the ExLlamaV2 maintainer.
Llama 2 7B on an RTX 4090
A 7B model still clears 250 tokens/sec on one consumer card — competitive with much larger server stacks for interactive use.
Llama 2 70B on one card
Fractional bitrate is what lets a 70B model fit a single 24GB GPU at all — at 2.55 bpw it still generates around 38 tokens/sec.
Read those numbers as a best case, not a guarantee. They are vendor-stated benchmark figures from the ExLlamaV2 maintainer, reflecting optimal single-batch generation; expect lower throughput under multi-user or batched-inference load. EXL2 is the power-user format — the right pick when you have a single NVIDIA card, want the most tokens per second you can get, and are happy to tune bitrate to the megabyte.
06 — MLXMLX: built for the unified memory on every Mac.
MLX is Apple’s open-source machine-learning framework for Apple Silicon, launched in November 2023. It is built around the thing that makes a Mac different: the unified memory architecture, where CPU, GPU and Neural Engine share one physical pool of memory. MLX exploits that with true zero-copy tensor operations and lazy evaluation that fuses operations before execution — there is no copying tensors back and forth across a PCIe bus, because there is no separate VRAM to copy to.
That architecture is why Apple Silicon punches above its weight on large models. An M4 Max delivers up to 128GB of unified memory at 546 GB/s of bandwidth — bandwidth in data-center-GPU territory — which means it can hold and feed models that exceed the VRAM of a comparably priced discrete GPU. The capacity, not just the speed, is the point: you can load a model a discrete card simply could not fit.
The runtime story got materially better in 2026. Ollama switched to an MLX backend for Apple Silicon, and its own March 2026 benchmarks on a Qwen 3.5 35B-A3B model showed prefill rising from 1,154 to 1,810 tokens/sec and decode from 58 to 112 after the move — the M5’s GPU Neural Accelerators are targeted natively for the prefill stage. Those are vendor-reported figures; treat them as directional.
M4 Max ceiling
Up to 128GB of unified memory at 546 GB/s of bandwidth lets an M4 Max hold and feed models that would exceed a comparably priced discrete GPU's VRAM.
Qwen 3.5 35B-A3B, M5
After Ollama moved to an MLX backend (March 2026), vendor-reported prefill on this model rose from 1,154 to 1,810 tokens/sec and decode from 58 to 112.
Qwen 3.5 9B, MLX 4-bit
A 16GB MacBook Air runs a 9B model at 4-bit around 25 to 35 tokens/sec in community testing — capable local inference with no desktop GPU.
You will also see claims that MLX uses roughly 10% less memory than GGUF on a Mac and runs 15 to 30% faster at an equivalent quantization level, and that 4-bit MLX retains around 97% of the full-precision MMLU score. Those figures come from secondary community benchmarks rather than independently verified measurements, so read them as directional rather than precise. The reliable takeaway is simpler: on an M-series Mac, MLX is the native path and now the default Apple-Silicon backend in Ollama — and for edge and on-device work it is a first-class option. For the broader case for running models locally, see our guide to on-device AI agent inference.
07 — bitsandbytesbitsandbytes: the only training-safe format here.
bitsandbytes is the odd one out, and deliberately so. It performs on-the-fly quantization — NF4 (4-bit Normal Float) and LLM.int8() (8-bit) — without requiring any pre-quantized model file. Weights are decompressed only when needed during the forward pass. NF4 is theoretically optimal for normally distributed weights: it comes from the QLoRA paper, and its quantile bins are matched to the normal distribution’s CDF, giving it an edge over plain FP4. NF4 cuts memory 4x versus FP16, and nested (“double”) quantization saves another 0.4 bits per parameter on top.
What sets bitsandbytes apart is that it is the only format in this set designed for training, not just inference. A QLoRA fine-tune freezes the 4-bit NF4 base model and trains small LoRA adapter matrices on top — which is what makes fine-tuning a 13B model on a single 16GB NVIDIA T4 possible. The 8-bit path, LLM.int8(), uses mixed-precision decomposition: outlier hidden-state values above roughly 6 are kept in FP16 while the rest go to INT8, which avoids the catastrophic quality loss naive INT8 causes on transformers. Minimum hardware is an NVIDIA Turing GPU (RTX 20-series or T4) or newer.
NF4 + QLoRA
4-bit Normal Float with bins matched to the normal distribution. Freeze the NF4 base, train LoRA adapters on top — that is QLoRA, and it puts a 13B fine-tune within reach of a single 16GB NVIDIA T4.
LLM.int8()
Mixed-precision decomposition keeps the handful of large-magnitude outlier activations in FP16 and the rest in INT8, sidestepping the catastrophic loss naive INT8 causes on transformers.
08 — Decision MatrixSix formats, mapped to your hardware.
Here is the whole field in one view. Read it by hardware first: the format that fits your silicon eliminates most of the menu before quality or speed even enters the conversation. The grouping reinforces the distinction from Section 01 — self-contained file formats, algorithms stored as safetensors, and the on-the-fly outlier.
| Format | Hardware | Primary runtimes | Bit depths | Training-safe | Quality / speed note |
|---|---|---|---|---|---|
| Self-contained file formats | |||||
| GGUF (k-quants) | CPU · NVIDIA · AMD · Apple Silicon (+ partial offload) | llama.cpp, Ollama, LM Studio | 2 to 8 bit | No | K-quants match GPTQ-4 quality; moderate on CPU, fast on GPU; the portable default. |
| EXL2 | NVIDIA GPU only | ExLlamaV2 + TabbyAPI | 2 to 8 bpw, any fractional avg | No | Top single-GPU tokens/sec; dial bitrate to your exact VRAM. |
| MLX | Apple Silicon only | mlx-lm, Ollama (macOS) | 3 to 8 bit | Partial (mlx-lm trains) | Native unified-memory speed on M-series; the Mac default. |
| Quantization algorithms (Hugging Face safetensors) | |||||
| GPTQ | NVIDIA GPU only | vLLM (Marlin/Machete), HF TGI, GPTQModel | 2 to 4 bit | No | Baseline 4-bit; fast kernels, but superseded by AWQ for new models. |
| AWQ | NVIDIA GPU primary (+ edge/mobile GPU) | vLLM, HF TGI, TensorRT-LLM, LMDeploy | 4-bit INT4 | No | Better than GPTQ on reasoning; about 3x FP16; production GPU default. |
| On-the-fly (no static file) | |||||
| bitsandbytes NF4 | NVIDIA (Turing / T4 or newer) | Hugging Face Transformers | 4-bit NF4 / 8-bit INT8 | Yes (QLoRA) | Slower per forward pass; built for fine-tuning, not peak inference. |
Reach for GGUF
Laptops, CPU-only boxes, or a mix of CPU and GPU: GGUF is the only format that runs across all of them and can split a model between CPU and GPU. Start at Q4_K_M, step up to Q5_0 or Q8_0 if you have the memory.
AWQ first, EXL2 for speed
On NVIDIA, default to AWQ for new models — better reasoning at 4-bit and first-class support in vLLM, TGI and TensorRT-LLM. If you are single-GPU and chasing maximum tokens/sec, EXL2 lets you dial bitrate to your card.
MLX, or GGUF for portability
On an M-series Mac, MLX is the native path and now the default Apple-Silicon backend in Ollama. Choose GGUF instead only when you want the same file to also run on a CPU box or a PC GPU.
bitsandbytes NF4
Training adapters, not just running inference: bitsandbytes NF4 freezes a 4-bit base and trains LoRA layers, fitting a 13B fine-tune on a 16GB T4. Convert to GGUF or AWQ afterwards for fast serving.
For most teams the honest answer is more than one format: GGUF for the laptop prototype, AWQ on the inference server, bitsandbytes for the fine-tune. Standing up that pipeline — choosing formats, benchmarking on your own prompts, and wiring it into a runtime that fits your hardware — is exactly the kind of work our AI digital transformation engagements start with. If the deployment target is the open web or an internal app, our web development team ships the serving layer around it. For the bigger build-versus-buy picture, our guide to self-hosting open-weight models sets the context for when a local format is worth the operational weight.
09 — ConclusionThe format is a hardware decision, not a bit-width one.
Pick the format your silicon was built for, then tune the bits.
The menu collapses the moment you sort by hardware. GGUF for portability and CPU or mixed machines; AWQ, EXL2 or GPTQ on NVIDIA; MLX on Apple Silicon; bitsandbytes when you are fine-tuning rather than just serving. Get that first cut right and the comparison stops being overwhelming — most of the formats simply do not apply to the box in front of you.
The clearest trend underneath all of it is that hardware-native formats are pulling ahead. AWQ displaced GPTQ on NVIDIA because its activation-aware approach exploits where the error actually lives; MLX wins on Apple because it is built around unified memory; GGUF k-quants endure because they spread a fixed bit budget intelligently across layers. The formats that win are the ones that understand the memory architecture, not just the nominal bit-width.
Expect the question to keep simplifying — runtimes are converging on a couple of sensible defaults per platform — but the file-format versus algorithm distinction will persist, and the orthogonal lever of how many bits to keep, covered in our companion piece on 4-bit, 8-bit and FP8 tradeoffs, is the next decision after this one. Whichever format your hardware points you to, run a quick perplexity and latency test on your own model before you commit. The Q5_0-beats-FP16 result is a useful reminder that the only benchmark that matters is the one on your workload.