Quantization is the single biggest lever on open-weight inference economics. The right level — FP16 → FP8 → INT8 → AWQ-4 → GPTQ-4 → INT3 — depends on the workload, not the model. The cross-model regression curves below settle most of the production debate.
Across six 2026 open-weight 70B-class models — Llama 4 70B, Qwen 3 72B, DeepSeek V4-Flash, Mistral Large 2, Command-R+, and Yi 2 — FP8 lands within 0.4 points of FP16 on MMLU-Pro and HumanEval+, INT8 within 0.7 points, AWQ-4 within 1.6 points, and GPTQ-4 within 1.9 points. VRAM drops 50%, 50%, 75%, and 75% respectively. Batch-1 decode throughput lift on H100 ranges from 1.4× (FP8) to 3.1× (AWQ-4).
This post publishes the cross-model regression curves, break-even tables for each quant level, and a decision matrix by workload class.
- 01FP8 is the right default in 2026 — within noise on quality, half the memory.Across all six tested models, FP8 (E4M3) shows -0.3 to -0.5 point MMLU-Pro regression vs FP16. That's within run-to-run benchmark noise on most workloads. Throughput lifts 1.4-1.7×. The combination makes FP8 the default for any new open-weight deployment on H100 or newer hardware.
- 02AWQ-4 is the right call when VRAM is the binding constraint.AWQ-4 (Activation-aware Weight Quantization at 4 bits) drops VRAM 75% and lifts throughput 3.1× at the cost of 1.4-1.8 point quality regression. The trade is right when fitting a larger model on a smaller cluster matters more than peak quality — typically mid-volume self-hosters running on 4×H100 instead of 8×H100.
- 03GPTQ-4 is incrementally more aggressive than AWQ-4 with more variance across models.GPTQ-4's quality regression varies 1.5-2.4 points across the test set, vs AWQ-4's tighter 1.3-1.8 range. AWQ wins on consistency; GPTQ sometimes wins on per-model peak. Default to AWQ; benchmark GPTQ when exploring the last few percent of accuracy retention.
- 04INT3 and below are research-grade, not production — quality regression is workload-dependent and severe.Below 4-bit (INT3, INT2, ternary), regression jumps to 4-12 points on most benchmarks and is highly task-dependent. Some chat workloads tolerate it; long-context retrieval, code generation, and math reasoning collapse. Production answer: stay at 4-bit minimum until the architecture catches up.
- 05Always run an A/B eval on your specific workload before committing.Aggregate benchmark scores hide workload variance. A model that drops 1 point on MMLU-Pro might drop 4 points on a domain-specific task that matters more. Build a 100-200 prompt eval set against your production usage pattern and score it across quant levels before pushing the change.
01 — The LadderThe quantization ladder, top to bottom.
Modern LLM quantization is a ladder of precision levels, each with characteristic memory savings, throughput gains, and quality regressions. The ladder runs from FP16 (the training precision baseline) down through FP8, INT8, 4-bit (AWQ, GPTQ, GGUF Q4), 3-bit, and below. Each step has a different sweet spot.
FP16 — the baseline
16 bits per weight · 100% memoryReference precision. No quality regression. Full memory cost. Use as benchmark target; rarely deployed in production at frontier scale because the throughput cost vs FP8 is unjustified on Hopper or newer hardware.
ReferenceFP8 — the modern production baseline
8 bits per weight · 50% memoryE4M3 (preferred) or E5M2 floating-point format. Native hardware support on H100, H200, B100/B200. Quality regression: 0.3-0.5 points across the test set. Throughput lift: 1.4-1.7×. Default 2026 production setting.
Default 2026INT8 — the broad-compatibility option
8 bits per weight · 50% memory · integerPer-channel scale, full integer arithmetic. Slightly more aggressive regression than FP8 (0.5-0.9 points) due to less dynamic range. Right call on hardware without native FP8 (A100, AMD MI300X mid-2024 ROCm). Otherwise prefer FP8.
Pre-Hopper hardwareAWQ-4 / GPTQ-4 — the 4-bit options
4 bits per weight · 75% memory savingsActivation-aware (AWQ) or post-training (GPTQ) 4-bit weight-only quantization. Quality regression: 1.4-2.4 points. VRAM drops 75%. Throughput lifts 2.6-3.1×. Right when fitting on a smaller cluster matters or when batch size is the binding throughput constraint.
Smaller cluster02 — QualityQuality regression across six models.
The quality data below is averaged across MMLU-Pro, HumanEval+, GPQA Diamond, and MATH-500 for the six 2026 open-weight 70B-class models tested. Numbers are absolute point regression vs the FP16 baseline of each respective model.
Quality regression vs FP16 · cross-model average
Source: vLLM 0.7 + llm-compressor benchmarks · 6 model average · Apr 2026Two reads. First: the precipice between 4-bit and 3-bit is real — at AWQ-4 you lose 1.6 points on average; at INT3 you lose 6 points. Sub-4-bit quantization is research-grade for most production workloads. Second: FP8 is essentially free quality-wise (0.4 points is well within run-to-run noise on these benchmarks). For any deployment on H100 or newer hardware, leaving FP8 disabled is leaving 30-50% throughput on the table for no quality reason.
03 — Throughput & VRAMThe economic gain from quantization.
Quality regression is the cost; throughput and VRAM are the gains. The two run on inverse curves — the more aggressive the quantization, the bigger the throughput lift and VRAM savings, up to a point where memory bandwidth saturates and additional compression stops helping. On H100 in 2026, that ceiling sits around 3-4× lift for batch-1 decode.
Throughput lift · batch-1 decode
On H100 with vLLM 0.7 and Llama 4 70B, FP8 lifts batch-1 decode tokens-per-second by 1.4-1.7× vs FP16. Memory drops 50%. Compounds with KV cache FP8 for additional gains. Right default for new deployments on Hopper or newer hardware.
Default 2026Throughput lift · batch-1 decode
Largest measured throughput lift. Memory drops 75%. Right when fitting on a smaller cluster (4×H100 instead of 8×) or when batch size is the binding constraint. Quality cost (1.6 points) is the trade.
Best TPS gainThroughput lift · batch-1 decode
Slightly behind FP8 on H100 because FP8 has native hardware support and INT8 doesn't. On AMD MI300X and on A100 (no native FP8), INT8 wins. Always benchmark on the actual target hardware before locking in.
Cross-platform"FP8 is free money on H100. AWQ-4 is the conscious trade. INT3 is a research project."— Internal serving-stack notes, May 2026
04 — MethodsGPTQ vs AWQ vs FP8 vs INT8 — mechanically.
The four major quantization methods used in production are not interchangeable. Each has a different objective function and a different post-training recipe, and the right choice depends on both the workload and the available calibration data.
Floating-point, native hardware support
8-bit floating-point with 4-bit exponent, 3-bit mantissa. Native on H100/H200/B100. Per-tensor or per-channel scale. Best dynamic range of the 8-bit options. The right default in 2026 for any Hopper or newer deployment.
Default 8-bit optionInteger with explicit scale per channel
8-bit integer with a per-output-channel scale factor. Symmetric or asymmetric. Cross-platform compatibility (works on A100, MI300X). Slightly worse dynamic range than FP8. Use when target hardware lacks native FP8.
Cross-platform 8-bitWeight-only 4-bit, activation-informed scaling
Scales weights so that columns with the largest activation magnitudes are preserved with higher fidelity. More consistent quality across models than GPTQ. Slightly slower offline calibration but faster runtime. Default 2026 4-bit choice.
Default 4-bit optionLayer-by-layer least-squares quantization
Hessian-based post-training quantization with second-order error correction. Sometimes wins on per-model peak accuracy; more variance across the test set. Worth benchmarking against AWQ on your specific workload; AWQ is the better default.
Per-model peak hunting05 — DecisionPicking a level by workload class.
Workload class governs which trade is acceptable. Code generation and math reasoning are the most quantization-sensitive (boundary cases matter); chat and summarization are the least sensitive (averaged-out semantics). Long-context retrieval is in between but skews toward sensitive on multi-needle tasks.
Chat / summarization · short-context
Low quantization sensitivity. FP8 is free money; AWQ-4 acceptable on memory-constrained deployments. The eval tax to verify is small (a 50-prompt smoke test usually settles it). Default to FP8 with AWQ-4 as a fallback for VRAM-bound clusters.
FP8 default · AWQ-4 backupCode generation · function correctness
High sensitivity. HumanEval+ and SWE-Bench scores drop more than averaged benchmarks suggest at 4-bit. FP8 is fine; AWQ-4 needs a real eval before committing. INT3 catastrophic. Default to FP8; reserve AWQ-4 for non-critical paths.
FP8 only for critical pathsLong-context retrieval · multi-needle
Sensitive at the tail of the context window. FP8 acceptable; AWQ-4 risks dropping NIAH-2 multi-needle scores by 4-7 points. KV cache quantization is the bigger lever here than weight quantization — handle KV first.
FP8 weights · FP8 KV firstMath reasoning · agentic tool-use
Highest sensitivity. Even FP8 can show 0.5-1 point regression on MATH-500 and AIME-class benchmarks. Use FP16 baseline for highest-stakes reasoning paths; FP8 acceptable elsewhere; avoid 4-bit. Pair with full-precision reasoning eval if production-critical.
FP16 critical · FP8 baseline06 — PitfallsPitfalls and eval gates worth running.
- Skipping the workload eval. The single most common mistake. Aggregate benchmarks tell you the average regression; your workload may be on the long tail. Always build a 100-200 prompt eval set against production patterns before committing.
- Quantizing weights without quantizing KV cache. Weight quantization helps memory and weight-load throughput; KV cache quantization helps long-context throughput. They compound. A deployment with FP8 weights and FP16 KV cache is leaving half the gain on the table.
- Mixing AWQ and GPTQ across deployment paths without consistency. Each method has different per-model quirks. Pick one and stay on it for a deployment so eval results are comparable. Switching mid-stream wastes a week of regression triage.
- Trusting calibration data that doesn't match workload. AWQ and GPTQ both use a calibration dataset for their post-training calibration step. If the calibration set is generic (Wikipedia) and your workload is domain-specific (legal, medical, code), the activation distribution will not match and quality will be worse than benchmarks suggest. Use workload-representative calibration data.
- Ignoring per-task variance. Some tasks are robust to 4-bit; some collapse. Long-form generation typically robust; short-answer factual recall typically sensitive. Build a task-class breakdown into the eval set, not just an aggregate score.
07 — ConclusionQuantization is free money if you measure.
FP8 is the new FP16. AWQ-4 is the right trade when VRAM binds. Below 4-bit is research.
By April 2026 the quantization landscape is settled at the top of the ladder: FP8 is the right default for any deployment on Hopper or newer hardware, INT8 the cross-platform fallback, AWQ-4 the right call when VRAM is the binding constraint. The cost regressions are well-characterized, the throughput gains are measurable, and the implementation is one config flag in vLLM, SGLang, or TensorRT-LLM.
The discipline is in the eval. Aggregate benchmark numbers tell you what to default to; workload-specific evaluation tells you whether the default holds. Build the eval first; flip the flag second; keep the eval running on every deployment to catch regressions before they hit production. That is the entire playbook.
The next generation of compression — sub-4-bit production-grade, mixed-precision per-layer, quantization-aware training built into the model from initialization — is in active research and will land in 2027-2028. The teams that have built the muscle to evaluate quantization rigorously will adopt those advances quarter-by-quarter; the teams that haven't will keep paying full FP16 rack rate for the next two years.