DevelopmentIndustry Guide14 min readPublished June 28, 2026

Five formats, one fit question · GGUF everywhere, AWQ on GPU, MLX on Mac

GGUF, AWQ, GPTQ, EXL2 & MLX: which fits your stack

GGUF, GPTQ, AWQ, EXL2 and MLX are not interchangeable — they target different hardware and trade speed against accuracy in different ways. Most comparisons blur a basic distinction: GGUF is a self-contained file format, while GPTQ and AWQ are quantization algorithms whose weights live inside Hugging Face safetensors. This guide maps each one to its runtime, hardware fit, and best use case.

DA
Digital Applied Team
Senior engineers · Published Jun 28, 2026
PublishedJun 28, 2026
Read time14 min
Sources8 primary docs
AWQ INT4 error cut
74%
penalty 4.57 to 1.17
vs naive INT4
GGUF hardware reach
4
CPU · NVIDIA · AMD · Apple
AWQ downloads
6M+
downloads on Hugging Face
EXL2 bitrate range
2–8
bpw · any fractional avg

Choosing between GGUF, AWQ, GPTQ, EXL2 and MLX is the most common LLM quantization decision developers get wrong — and picking the wrong format can, by some estimates, cost a meaningful share of inference speed before you have tuned a single parameter. The formats are not interchangeable: each targets different hardware and trades accuracy against throughput in its own way. This guide is the map.

The confusion starts with a category error. GGUF is a self-contained file format — weights, tokenizer, and metadata bundled into one portable artifact. GPTQ and AWQ are quantization algorithms whose output is stored as ordinary Hugging Face safetensors. EXL2 and MLX are formats again, each welded to a single runtime. Treat them as one menu and you will pair a format with hardware it was never built for.

Below: a plain-English breakdown of all six options — GGUF, GPTQ, AWQ, EXL2, MLX and bitsandbytes — a proprietary table mapping each to its runtime and hardware fit, the counterintuitive benchmark data from a January 2026 evaluation, and a decision tree for CPU, NVIDIA, and Apple Silicon. If you want the orthogonal question of how many bits to keep, our companion piece on 4-bit vs 8-bit vs FP8 tradeoffs covers the precision dimension this post deliberately sets aside.

Key takeaways
  1. 01
    Format and algorithm are not the same thing.GGUF is a self-contained container; GPTQ and AWQ are algorithms whose weights are stored as Hugging Face safetensors; EXL2 and MLX are formats bound to one runtime each. Pick on that axis before you pick a bit-width.
  2. 02
    GGUF is the universal default.It runs on CPU, NVIDIA, AMD and Apple Silicon, can split layers across CPU and GPU, and its k-quants match GPTQ-4 quality at the same average bit-width. The safe choice when hardware is mixed or unknown.
  3. 03
    On NVIDIA, AWQ has overtaken GPTQ.AWQ's activation-aware scaling cut the INT4 quantization penalty from 4.57 to 1.17 perplexity (about 74%) and it outperforms GPTQ on reasoning at the same 4 bits. New models now ship AWQ or GGUF first.
  4. 04
    EXL2 dials bitrate to your VRAM; MLX owns the Mac.EXL2 targets any fractional average between 2 and 8 bits per weight on a single NVIDIA GPU; MLX exploits Apple unified memory and is now the default Apple-Silicon backend in Ollama.
  5. 05
    bitsandbytes is the only training-safe format here.Its 4-bit NF4 path powers QLoRA fine-tuning by freezing a compressed base and training small adapters. Every other format in this comparison is inference-only.

01Format vs AlgorithmThe category error almost every comparison makes.

Before comparing speeds, fix the vocabulary. People say “quantization format” to mean three different kinds of thing, and that conflation is exactly how someone ends up with an EXL2 file they cannot load on a Mac, or expects a GGUF to run inside vLLM. There are three buckets.

A file format is a complete, self-contained artifact: the file you download is the model. GGUF, EXL2 and MLX are file formats, each tied to a specific runtime. A quantization algorithm is a procedure, not a file type — GPTQ and AWQ produce ordinary Hugging Face safetensors shards plus a config, which is why an “AWQ model” looks like any other Hugging Face repository. And on-the-fly quantization, the bitsandbytes approach, ships no pre-quantized file at all: weights are compressed at load time and decompressed during the forward pass.

Container
File formats
GGUF · EXL2 · MLX

The downloaded file is the artifact — weights, tokenizer and quant parameters bundled together, each tied to a specific runtime (llama.cpp, ExLlamaV2, mlx-lm).

Self-contained
Method
Algorithms
GPTQ · AWQ

A quantization procedure, not a file type. The output is standard Hugging Face safetensors shards plus a config — so the repo looks like any other HF model.

Stored as safetensors
Runtime
On-the-fly
bitsandbytes NF4 / INT8

No pre-quantized file at all. Weights are compressed at load time and decompressed per forward pass, which is what makes QLoRA fine-tuning possible.

Quantized at load
Why this matters first
Match the artifact to the runtime before you argue about bit-width. A self-contained format can only run in the engine it was built for; an algorithm’s safetensors output needs a runtime that implements that algorithm’s kernels. Get this layer wrong and no amount of careful 4-bit tuning will load the model on your machine.

02GGUFGGUF: the format that runs everywhere.

GGUF stands for “Georgi Gerganov Universal Format,” named after the creator of llama.cpp and ggml. It replaced the older GGML format in August 2023 — a distinction worth keeping straight, because GGML is the predecessor and GGUF is the current standard. A single GGUF file packages weights, tokenizer, metadata and quantization parameters into one portable artifact.

Its defining feature is hardware-agnosticism. GGUF runs on the CPU via llama.cpp, on NVIDIA GPUs through CUDA, on AMD through ROCm, and on Apple Silicon through Metal — and it can split a model’s layers between CPU and GPU, a trick called partial offload that lets you run a model slightly too large for your VRAM. No other format in this comparison spans that many targets.

On the quality side, llama.cpp’s K-quants use a super-block structure: weights are grouped into super-blocks and each block gets its own scaling factor, with the S/M/L suffix indicating block size (larger blocks mean more compression and slightly lower accuracy). K-quants consistently beat the older Q4_0 and Q5_0 formats at the same average bit-width. The table below uses independent perplexity and benchmark measurements from a January 2026 arXiv evaluation of llama.cpp quantization on Llama-3.1-8B-Instruct.

GGUF quantization tiers tested on Llama-3.1-8B-Instruct, showing average bits, size reduction, WikiText-2 perplexity, aggregate benchmark score and GSM8K math score per the January 2026 arXiv evaluation.
GGUF typeAvg bitsSize cutPerplexityAggregateGSM8KUse when
Reference & near-lossless
F16 (baseline)160%7.3269.4777.63Reference only — too large to ship
Q8_0846.9%7.3369.4177.48Near-lossless quality needed
The sweet spot
Q5_0565.2%7.4369.9279.08Best quality-per-byte on CPU
Q4_K_S470.8%7.6269.1777.33Tight VRAM on GPU
Aggressive compression
Q3_K_S377.2%8.9665.4968.31Maximum compression, clear cost

Two things in that table are worth a second look. First, the Q4_K_M tier most people actually run sits just above Q4_K_S in bit allocation and is the popular default — it is meaningfully better than the legacy Q4_0 at the same nominal width because attention and embedding layers get a slightly higher bit budget. Second, and more surprising, the 5-bit Q5_0 tier edged out FP16 on this run — 69.92 versus 69.47 on the aggregate score, and 79.08 versus 77.63 on GSM8K math. The paper attributes this to a mild regularization effect from quantization noise. Treat it as a specific, repeatable finding for this model, not a universal rule that 5-bit always beats full precision.

GGUF is the format you load in LM Studio for local model management, in Ollama, or directly through llama.cpp. If your priority is getting a capable model running on whatever machine is in front of you — including a laptop with no discrete GPU — GGUF is almost always the right first move.

03GPTQGPTQ: the GPU standard that 2025 outgrew.

GPTQ was the format that made single-GPU inference of giant models possible. It can quantize a 175-billion-parameter model in roughly 4 GPU hours to 3 or 4 bits with negligible accuracy degradation, and it more than doubled compression versus the one-shot methods that came before it. For two years that was the GPU quantization story — one prolific Hugging Face contributor, TheBloke, uploaded more than 2,000 GPTQ-quantized models and effectively set the de-facto standard for 2023 and 2024.

The runtime story is solid: vLLM serves GPTQ through its Marlin and Machete custom kernels, optimized for Ampere (A100 and up) and Hopper (H100 and up) NVIDIA GPUs. GPTQ also supports extreme compression down to 2-bit and even ternary quantization — but here the paper claims and the production reality diverge. Community experience shows quality degrades meaningfully below 4 bits, so treat sub-4-bit GPTQ as experimental rather than something you ship.

The honest 2026 framing is that GPTQ has been largely superseded by AWQ for new model releases. It still has the largest legacy library and remains a perfectly reasonable choice for existing GPTQ checkpoints, but when a new model drops, it now lands in AWQ or GGUF first.

Quant time
175B in roughly 4 GPU-hours
4h

GPTQ was the first method to quantize a 175-billion-parameter model to 3 to 4 bits with negligible accuracy loss, enabling single-GPU inference of models that previously needed a cluster.

3 to 4 bit
Ecosystem
TheBloke's GPTQ uploads
2k+

One prolific Hugging Face contributor published over 2,000 GPTQ-quantized models, which is why GPTQ became the de-facto GPU standard across 2023 and 2024.

Legacy standard
Floor
Extreme compression, real cost
2-bit

GPTQ supports 2-bit and even ternary quantization, but community testing shows quality drops sharply below 4 bits — keep production GPTQ at 4-bit.

NVIDIA only

04AWQAWQ: protect the one percent that matters.

AWQ — Activation-Aware Weight Quantization, from MIT’s Han Lab — is built on a single sharp observation: weights are not equally important. Identifying and protecting just the top 1% of salient weights, selected by activation magnitude rather than by the weight values themselves, sharply reduces quantization error. AWQ does this with per-channel scaling informed by activation patterns, and it needs no backpropagation and no weight reconstruction to do it.

That idea won the Best Paper Award at MLSys 2024, and the numbers back it up. On INT4 with group size 128, AWQ dropped the perplexity penalty from 4.57 to 1.17 — about a 74% reduction in degradation — and at the same 4-bit depth it consistently outperforms GPTQ on reasoning and instruction-following benchmarks. It also runs fast: a reported 3x-plus speedup over the Hugging Face FP16 implementation on both desktop and mobile GPUs. AWQ models on Hugging Face have passed 6 million downloads.

The AWQ insight
Per MIT’s Han Lab, protecting only 1% of salient weights — chosen by activation magnitude, not weight value — can greatly reduce quantization error, with no backpropagation or weight reconstruction required. On INT4 with group size 128 that cut the perplexity penalty from 4.57 to 1.17, roughly a 74% reduction.

Practically, AWQ has become the default for production GPU inference. It is natively supported across the major NVIDIA stacks — vLLM, Hugging Face TGI, NVIDIA TensorRT-LLM and LMDeploy — and is primarily designed for NVIDIA and edge or mobile GPU acceleration. One caveat on the marketing line: AWQ’s “lossless” claim refers specifically to vision-language models like VILA. For most text LLMs, AWQ at 4-bit is near-lossless, which is excellent, but not literally lossless — benchmark it on your own prompts before treating quality as a solved problem.

The GPTQ-to-AWQ handoff is the clearest trend in the GPU quantization space. It is not that GPTQ stopped working; it is that AWQ’s activation-aware approach delivers better quality at the same bit budget with broader first-class runtime support, so the ecosystem moved its default. When a format wins on both quality and tooling at once, the rest of the field follows within a release cycle or two — which is exactly what happened across 2025 and 2026.

05EXL2EXL2: dial the bitrate to fit your VRAM.

EXL2, the format used by ExLlamaV2, has a genuinely different idea at its core: instead of committing to a whole number of bits, it lets you target any average bitrate between 2 and 8 bits per weight by mixing quantization levels within a single model. Want 4.65 bits per weight to exactly fill a 24GB card? You can ask for it. The algorithm measures quantization error per matrix and allocates more bits to the layers that are most sensitive to it.

EXL2 supports 2, 3, 4, 5, 6 and 8-bit quantization, and allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight.— turboderp, ExLlamaV2 author

The trade-off is reach: EXL2 is NVIDIA-only. It does not support AMD, Apple Silicon, or CPU inference. It is backward-compatible with 4-bit GPTQ models, and the recommended production stack pairs ExLlamaV2 with TabbyAPI as a local API server. Where it earns its keep is raw single-GPU throughput — users report 2 to 5x speed increases versus standard loaders for quantized models on single-GPU setups.

1.1B @ 3.0 bpw
TinyLlama on an RTX 4090
770t/s

At the small end, EXL2 pushes token generation into the high hundreds. A vendor-stated single-batch figure from the ExLlamaV2 maintainer.

Single-batch
7B @ 3.0 bpw
Llama 2 7B on an RTX 4090
257t/s

A 7B model still clears 250 tokens/sec on one consumer card — competitive with much larger server stacks for interactive use.

Vendor-stated
70B @ 2.55 bpw
Llama 2 70B on one card
38t/s

Fractional bitrate is what lets a 70B model fit a single 24GB GPU at all — at 2.55 bpw it still generates around 38 tokens/sec.

One 24GB GPU

Read those numbers as a best case, not a guarantee. They are vendor-stated benchmark figures from the ExLlamaV2 maintainer, reflecting optimal single-batch generation; expect lower throughput under multi-user or batched-inference load. EXL2 is the power-user format — the right pick when you have a single NVIDIA card, want the most tokens per second you can get, and are happy to tune bitrate to the megabyte.

06MLXMLX: built for the unified memory on every Mac.

MLX is Apple’s open-source machine-learning framework for Apple Silicon, launched in November 2023. It is built around the thing that makes a Mac different: the unified memory architecture, where CPU, GPU and Neural Engine share one physical pool of memory. MLX exploits that with true zero-copy tensor operations and lazy evaluation that fuses operations before execution — there is no copying tensors back and forth across a PCIe bus, because there is no separate VRAM to copy to.

That architecture is why Apple Silicon punches above its weight on large models. An M4 Max delivers up to 128GB of unified memory at 546 GB/s of bandwidth — bandwidth in data-center-GPU territory — which means it can hold and feed models that exceed the VRAM of a comparably priced discrete GPU. The capacity, not just the speed, is the point: you can load a model a discrete card simply could not fit.

The runtime story got materially better in 2026. Ollama switched to an MLX backend for Apple Silicon, and its own March 2026 benchmarks on a Qwen 3.5 35B-A3B model showed prefill rising from 1,154 to 1,810 tokens/sec and decode from 58 to 112 after the move — the M5’s GPU Neural Accelerators are targeted natively for the prefill stage. Those are vendor-reported figures; treat them as directional.

Unified memory
M4 Max ceiling
128GB

Up to 128GB of unified memory at 546 GB/s of bandwidth lets an M4 Max hold and feed models that would exceed a comparably priced discrete GPU's VRAM.

546 GB/s
Ollama prefill
Qwen 3.5 35B-A3B, M5
1,810t/s

After Ollama moved to an MLX backend (March 2026), vendor-reported prefill on this model rose from 1,154 to 1,810 tokens/sec and decode from 58 to 112.

Vendor-reported
Mac Air, real world
Qwen 3.5 9B, MLX 4-bit
25–35t/s

A 16GB MacBook Air runs a 9B model at 4-bit around 25 to 35 tokens/sec in community testing — capable local inference with no desktop GPU.

Community benchmark

You will also see claims that MLX uses roughly 10% less memory than GGUF on a Mac and runs 15 to 30% faster at an equivalent quantization level, and that 4-bit MLX retains around 97% of the full-precision MMLU score. Those figures come from secondary community benchmarks rather than independently verified measurements, so read them as directional rather than precise. The reliable takeaway is simpler: on an M-series Mac, MLX is the native path and now the default Apple-Silicon backend in Ollama — and for edge and on-device work it is a first-class option. For the broader case for running models locally, see our guide to on-device AI agent inference.

07bitsandbytesbitsandbytes: the only training-safe format here.

bitsandbytes is the odd one out, and deliberately so. It performs on-the-fly quantization — NF4 (4-bit Normal Float) and LLM.int8() (8-bit) — without requiring any pre-quantized model file. Weights are decompressed only when needed during the forward pass. NF4 is theoretically optimal for normally distributed weights: it comes from the QLoRA paper, and its quantile bins are matched to the normal distribution’s CDF, giving it an edge over plain FP4. NF4 cuts memory 4x versus FP16, and nested (“double”) quantization saves another 0.4 bits per parameter on top.

What sets bitsandbytes apart is that it is the only format in this set designed for training, not just inference. A QLoRA fine-tune freezes the 4-bit NF4 base model and trains small LoRA adapter matrices on top — which is what makes fine-tuning a 13B model on a single 16GB NVIDIA T4 possible. The 8-bit path, LLM.int8(), uses mixed-precision decomposition: outlier hidden-state values above roughly 6 are kept in FP16 while the rest go to INT8, which avoids the catastrophic quality loss naive INT8 causes on transformers. Minimum hardware is an NVIDIA Turing GPU (RTX 20-series or T4) or newer.

4-bit
NF4 + QLoRA
double-quant · −0.4 bit/param

4-bit Normal Float with bins matched to the normal distribution. Freeze the NF4 base, train LoRA adapters on top — that is QLoRA, and it puts a 13B fine-tune within reach of a single 16GB NVIDIA T4.

Training-safe
8-bit
LLM.int8()
FP16 outliers + INT8 rest

Mixed-precision decomposition keeps the handful of large-magnitude outlier activations in FP16 and the rest in INT8, sidestepping the catastrophic loss naive INT8 causes on transformers.

Turing (T4) or newer
What 'training-safe' actually means
Per the Hugging Face docs, 8- and 4-bit training is only supported for training extra parameters — the LoRA adapters. The quantized base weights stay frozen. You are not fully fine-tuning a model in NF4; you are training small adapter matrices against a compressed backbone. For pure inference, a dedicated format (GGUF, AWQ or EXL2) will serve faster.

08Decision MatrixSix formats, mapped to your hardware.

Here is the whole field in one view. Read it by hardware first: the format that fits your silicon eliminates most of the menu before quality or speed even enters the conversation. The grouping reinforces the distinction from Section 01 — self-contained file formats, algorithms stored as safetensors, and the on-the-fly outlier.

Master comparison of GGUF, EXL2, MLX, GPTQ, AWQ and bitsandbytes NF4 by hardware target, primary runtimes, supported bit depths, whether the format is training-safe, and a quality or speed note.
FormatHardwarePrimary runtimesBit depthsTraining-safeQuality / speed note
Self-contained file formats
GGUF (k-quants)CPU · NVIDIA · AMD · Apple Silicon (+ partial offload)llama.cpp, Ollama, LM Studio2 to 8 bitNoK-quants match GPTQ-4 quality; moderate on CPU, fast on GPU; the portable default.
EXL2NVIDIA GPU onlyExLlamaV2 + TabbyAPI2 to 8 bpw, any fractional avgNoTop single-GPU tokens/sec; dial bitrate to your exact VRAM.
MLXApple Silicon onlymlx-lm, Ollama (macOS)3 to 8 bitPartial (mlx-lm trains)Native unified-memory speed on M-series; the Mac default.
Quantization algorithms (Hugging Face safetensors)
GPTQNVIDIA GPU onlyvLLM (Marlin/Machete), HF TGI, GPTQModel2 to 4 bitNoBaseline 4-bit; fast kernels, but superseded by AWQ for new models.
AWQNVIDIA GPU primary (+ edge/mobile GPU)vLLM, HF TGI, TensorRT-LLM, LMDeploy4-bit INT4NoBetter than GPTQ on reasoning; about 3x FP16; production GPU default.
On-the-fly (no static file)
bitsandbytes NF4NVIDIA (Turing / T4 or newer)Hugging Face Transformers4-bit NF4 / 8-bit INT8Yes (QLoRA)Slower per forward pass; built for fine-tuning, not peak inference.
CPU or mixed / unknown HW
Reach for GGUF

Laptops, CPU-only boxes, or a mix of CPU and GPU: GGUF is the only format that runs across all of them and can split a model between CPU and GPU. Start at Q4_K_M, step up to Q5_0 or Q8_0 if you have the memory.

Pick GGUF (Q4_K_M)
NVIDIA production server
AWQ first, EXL2 for speed

On NVIDIA, default to AWQ for new models — better reasoning at 4-bit and first-class support in vLLM, TGI and TensorRT-LLM. If you are single-GPU and chasing maximum tokens/sec, EXL2 lets you dial bitrate to your card.

Pick AWQ (or EXL2)
Apple Silicon
MLX, or GGUF for portability

On an M-series Mac, MLX is the native path and now the default Apple-Silicon backend in Ollama. Choose GGUF instead only when you want the same file to also run on a CPU box or a PC GPU.

Pick MLX
QLoRA fine-tuning
bitsandbytes NF4

Training adapters, not just running inference: bitsandbytes NF4 freezes a 4-bit base and trains LoRA layers, fitting a 13B fine-tune on a 16GB T4. Convert to GGUF or AWQ afterwards for fast serving.

Pick bitsandbytes NF4

For most teams the honest answer is more than one format: GGUF for the laptop prototype, AWQ on the inference server, bitsandbytes for the fine-tune. Standing up that pipeline — choosing formats, benchmarking on your own prompts, and wiring it into a runtime that fits your hardware — is exactly the kind of work our AI digital transformation engagements start with. If the deployment target is the open web or an internal app, our web development team ships the serving layer around it. For the bigger build-versus-buy picture, our guide to self-hosting open-weight models sets the context for when a local format is worth the operational weight.

09ConclusionThe format is a hardware decision, not a bit-width one.

Where quant formats are heading

Pick the format your silicon was built for, then tune the bits.

The menu collapses the moment you sort by hardware. GGUF for portability and CPU or mixed machines; AWQ, EXL2 or GPTQ on NVIDIA; MLX on Apple Silicon; bitsandbytes when you are fine-tuning rather than just serving. Get that first cut right and the comparison stops being overwhelming — most of the formats simply do not apply to the box in front of you.

The clearest trend underneath all of it is that hardware-native formats are pulling ahead. AWQ displaced GPTQ on NVIDIA because its activation-aware approach exploits where the error actually lives; MLX wins on Apple because it is built around unified memory; GGUF k-quants endure because they spread a fixed bit budget intelligently across layers. The formats that win are the ones that understand the memory architecture, not just the nominal bit-width.

Expect the question to keep simplifying — runtimes are converging on a couple of sensible defaults per platform — but the file-format versus algorithm distinction will persist, and the orthogonal lever of how many bits to keep, covered in our companion piece on 4-bit, 8-bit and FP8 tradeoffs, is the next decision after this one. Whichever format your hardware points you to, run a quick perplexity and latency test on your own model before you commit. The Q5_0-beats-FP16 result is a useful reminder that the only benchmark that matters is the one on your workload.

Deploy local and open-weight models in production

Match the quant format to the hardware and local inference gets genuinely fast.

Our team helps businesses choose, benchmark, and deploy the right quantization format for their hardware — GGUF, AWQ, EXL2 or MLX — and stand up the serving and fine-tuning pipeline around it, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Local-inference engagements

  • Format selection + perplexity/latency benchmarking on your model
  • GPU serving with vLLM / TensorRT-LLM (AWQ, GPTQ)
  • Apple Silicon and on-device deployment (MLX, GGUF)
  • QLoRA fine-tuning pipelines (bitsandbytes NF4)
  • Cost and hardware-fit reviews for local vs API inference
FAQ · LLM quant formats

The questions we get every week.

They are not the same kind of thing, which is the most common point of confusion. GGUF is a self-contained file format — a single portable file that bundles weights, tokenizer, metadata and quantization parameters, and runs through llama.cpp, Ollama or LM Studio across CPU, NVIDIA, AMD and Apple Silicon. GPTQ and AWQ are quantization algorithms, not file formats: their output is stored as ordinary Hugging Face safetensors shards plus a config, and they run on NVIDIA GPUs through engines like vLLM and TGI. So GGUF is a container you load directly, while GPTQ and AWQ are methods whose results you serve through a GPU runtime. Picking on that axis first — container versus algorithm, and which hardware each targets — matters more than the nominal bit-width.
Related dispatches

Continue exploring local AI and inference.