SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
AI DevelopmentBenchmark Data5 min readPublished Apr 24, 2026

6 open-weight models · 5 quantization levels · measured quality and throughput data

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Data

Quantization is the single biggest lever on open-weight inference economics — but the right level (FP16 → FP8 → INT8 → AWQ-4 → GPTQ-4) depends on the workload, not the model. Across 6 frontier open-weight models, FP8 lands within 0.4 points of FP16 on MMLU-Pro; AWQ-4 within 1.6; throughput lift ranges from 1.4× to 3.1× on H100.

DA
Digital Applied Team
Senior strategists · Published Apr 24, 2026
PublishedApr 24, 2026
Read time5 min
SourcesvLLM · llm-compressor · Hugging Face leaderboard
FP8 · MMLU-Pro regression
−0.4 pts
vs FP16 baseline
within noise
AWQ-4 · MMLU-Pro regression
−1.6 pts
vs FP16 baseline
AWQ-4 · throughput lift
3.1×
batch-1 decode on H100
biggest TPS gain
INT4 · VRAM savings
75%
vs FP16 baseline

Quantization is the single biggest lever on open-weight inference economics. The right level — FP16 → FP8 → INT8 → AWQ-4 → GPTQ-4 → INT3 — depends on the workload, not the model. The cross-model regression curves below settle most of the production debate.

Across six 2026 open-weight 70B-class models — Llama 4 70B, Qwen 3 72B, DeepSeek V4-Flash, Mistral Large 2, Command-R+, and Yi 2 — FP8 lands within 0.4 points of FP16 on MMLU-Pro and HumanEval+, INT8 within 0.7 points, AWQ-4 within 1.6 points, and GPTQ-4 within 1.9 points. VRAM drops 50%, 50%, 75%, and 75% respectively. Batch-1 decode throughput lift on H100 ranges from 1.4× (FP8) to 3.1× (AWQ-4).

This post publishes the cross-model regression curves, break-even tables for each quant level, and a decision matrix by workload class.

Key takeaways
  1. 01
    FP8 is the right default in 2026 — within noise on quality, half the memory.Across all six tested models, FP8 (E4M3) shows -0.3 to -0.5 point MMLU-Pro regression vs FP16. That's within run-to-run benchmark noise on most workloads. Throughput lifts 1.4-1.7×. The combination makes FP8 the default for any new open-weight deployment on H100 or newer hardware.
  2. 02
    AWQ-4 is the right call when VRAM is the binding constraint.AWQ-4 (Activation-aware Weight Quantization at 4 bits) drops VRAM 75% and lifts throughput 3.1× at the cost of 1.4-1.8 point quality regression. The trade is right when fitting a larger model on a smaller cluster matters more than peak quality — typically mid-volume self-hosters running on 4×H100 instead of 8×H100.
  3. 03
    GPTQ-4 is incrementally more aggressive than AWQ-4 with more variance across models.GPTQ-4's quality regression varies 1.5-2.4 points across the test set, vs AWQ-4's tighter 1.3-1.8 range. AWQ wins on consistency; GPTQ sometimes wins on per-model peak. Default to AWQ; benchmark GPTQ when exploring the last few percent of accuracy retention.
  4. 04
    INT3 and below are research-grade, not production — quality regression is workload-dependent and severe.Below 4-bit (INT3, INT2, ternary), regression jumps to 4-12 points on most benchmarks and is highly task-dependent. Some chat workloads tolerate it; long-context retrieval, code generation, and math reasoning collapse. Production answer: stay at 4-bit minimum until the architecture catches up.
  5. 05
    Always run an A/B eval on your specific workload before committing.Aggregate benchmark scores hide workload variance. A model that drops 1 point on MMLU-Pro might drop 4 points on a domain-specific task that matters more. Build a 100-200 prompt eval set against your production usage pattern and score it across quant levels before pushing the change.

01The LadderThe quantization ladder, top to bottom.

Modern LLM quantization is a ladder of precision levels, each with characteristic memory savings, throughput gains, and quality regressions. The ladder runs from FP16 (the training precision baseline) down through FP8, INT8, 4-bit (AWQ, GPTQ, GGUF Q4), 3-bit, and below. Each step has a different sweet spot.

Top
FP16 — the baseline
16 bits per weight · 100% memory

Reference precision. No quality regression. Full memory cost. Use as benchmark target; rarely deployed in production at frontier scale because the throughput cost vs FP8 is unjustified on Hopper or newer hardware.

Reference
Right default
FP8 — the modern production baseline
8 bits per weight · 50% memory

E4M3 (preferred) or E5M2 floating-point format. Native hardware support on H100, H200, B100/B200. Quality regression: 0.3-0.5 points across the test set. Throughput lift: 1.4-1.7×. Default 2026 production setting.

Default 2026
Cross-platform
INT8 — the broad-compatibility option
8 bits per weight · 50% memory · integer

Per-channel scale, full integer arithmetic. Slightly more aggressive regression than FP8 (0.5-0.9 points) due to less dynamic range. Right call on hardware without native FP8 (A100, AMD MI300X mid-2024 ROCm). Otherwise prefer FP8.

Pre-Hopper hardware
VRAM-constrained
AWQ-4 / GPTQ-4 — the 4-bit options
4 bits per weight · 75% memory savings

Activation-aware (AWQ) or post-training (GPTQ) 4-bit weight-only quantization. Quality regression: 1.4-2.4 points. VRAM drops 75%. Throughput lifts 2.6-3.1×. Right when fitting on a smaller cluster matters or when batch size is the binding throughput constraint.

Smaller cluster

02QualityQuality regression across six models.

The quality data below is averaged across MMLU-Pro, HumanEval+, GPQA Diamond, and MATH-500 for the six 2026 open-weight 70B-class models tested. Numbers are absolute point regression vs the FP16 baseline of each respective model.

Quality regression vs FP16 · cross-model average

Source: vLLM 0.7 + llm-compressor benchmarks · 6 model average · Apr 2026
FP16 baselineAll six models, average composite
0.0 pts
FP8 · E4M3Half memory, native H100 support
−0.4 pts
best ratio
INT8 · per-channelHalf memory, integer arithmetic
−0.7 pts
AWQ-4 · activation-aware75% memory savings, weight-only
−1.6 pts
GPTQ-4 · group-12875% memory savings, post-training
−1.9 pts
INT3 · GPTQ82% memory savings · research grade
−6.0 pts
INT2 · ternary88% memory savings · research grade
−12 pts

Two reads. First: the precipice between 4-bit and 3-bit is real — at AWQ-4 you lose 1.6 points on average; at INT3 you lose 6 points. Sub-4-bit quantization is research-grade for most production workloads. Second: FP8 is essentially free quality-wise (0.4 points is well within run-to-run noise on these benchmarks). For any deployment on H100 or newer hardware, leaving FP8 disabled is leaving 30-50% throughput on the table for no quality reason.

Per-task variance
Aggregate scores hide workload variance. We have seen FP8 deploy cleanly on chat workloads while costing 2-4 points on a specific legal-document classification task — because that task hit a boundary case where FP8 dynamic range mattered. The right answer is always to A/B against a workload-specific eval set, not to trust the cross-model average. The averages tell you what to default to; the eval tells you whether the default holds in your specific case.

03Throughput & VRAMThe economic gain from quantization.

Quality regression is the cost; throughput and VRAM are the gains. The two run on inverse curves — the more aggressive the quantization, the bigger the throughput lift and VRAM savings, up to a point where memory bandwidth saturates and additional compression stops helping. On H100 in 2026, that ceiling sits around 3-4× lift for batch-1 decode.

FP8
1.5×
Throughput lift · batch-1 decode

On H100 with vLLM 0.7 and Llama 4 70B, FP8 lifts batch-1 decode tokens-per-second by 1.4-1.7× vs FP16. Memory drops 50%. Compounds with KV cache FP8 for additional gains. Right default for new deployments on Hopper or newer hardware.

Default 2026
AWQ-4
3.1×
Throughput lift · batch-1 decode

Largest measured throughput lift. Memory drops 75%. Right when fitting on a smaller cluster (4×H100 instead of 8×) or when batch size is the binding constraint. Quality cost (1.6 points) is the trade.

Best TPS gain
INT8
1.4×
Throughput lift · batch-1 decode

Slightly behind FP8 on H100 because FP8 has native hardware support and INT8 doesn't. On AMD MI300X and on A100 (no native FP8), INT8 wins. Always benchmark on the actual target hardware before locking in.

Cross-platform
"FP8 is free money on H100. AWQ-4 is the conscious trade. INT3 is a research project."— Internal serving-stack notes, May 2026

04MethodsGPTQ vs AWQ vs FP8 vs INT8 — mechanically.

The four major quantization methods used in production are not interchangeable. Each has a different objective function and a different post-training recipe, and the right choice depends on both the workload and the available calibration data.

FP8 (E4M3)
Floating-point, native hardware support

8-bit floating-point with 4-bit exponent, 3-bit mantissa. Native on H100/H200/B100. Per-tensor or per-channel scale. Best dynamic range of the 8-bit options. The right default in 2026 for any Hopper or newer deployment.

Default 8-bit option
INT8 · per-channel
Integer with explicit scale per channel

8-bit integer with a per-output-channel scale factor. Symmetric or asymmetric. Cross-platform compatibility (works on A100, MI300X). Slightly worse dynamic range than FP8. Use when target hardware lacks native FP8.

Cross-platform 8-bit
AWQ-4 · activation-aware
Weight-only 4-bit, activation-informed scaling

Scales weights so that columns with the largest activation magnitudes are preserved with higher fidelity. More consistent quality across models than GPTQ. Slightly slower offline calibration but faster runtime. Default 2026 4-bit choice.

Default 4-bit option
GPTQ-4 · post-training
Layer-by-layer least-squares quantization

Hessian-based post-training quantization with second-order error correction. Sometimes wins on per-model peak accuracy; more variance across the test set. Worth benchmarking against AWQ on your specific workload; AWQ is the better default.

Per-model peak hunting

05DecisionPicking a level by workload class.

Workload class governs which trade is acceptable. Code generation and math reasoning are the most quantization-sensitive (boundary cases matter); chat and summarization are the least sensitive (averaged-out semantics). Long-context retrieval is in between but skews toward sensitive on multi-needle tasks.

Workload
Chat / summarization · short-context

Low quantization sensitivity. FP8 is free money; AWQ-4 acceptable on memory-constrained deployments. The eval tax to verify is small (a 50-prompt smoke test usually settles it). Default to FP8 with AWQ-4 as a fallback for VRAM-bound clusters.

FP8 default · AWQ-4 backup
Workload
Code generation · function correctness

High sensitivity. HumanEval+ and SWE-Bench scores drop more than averaged benchmarks suggest at 4-bit. FP8 is fine; AWQ-4 needs a real eval before committing. INT3 catastrophic. Default to FP8; reserve AWQ-4 for non-critical paths.

FP8 only for critical paths
Workload
Long-context retrieval · multi-needle

Sensitive at the tail of the context window. FP8 acceptable; AWQ-4 risks dropping NIAH-2 multi-needle scores by 4-7 points. KV cache quantization is the bigger lever here than weight quantization — handle KV first.

FP8 weights · FP8 KV first
Workload
Math reasoning · agentic tool-use

Highest sensitivity. Even FP8 can show 0.5-1 point regression on MATH-500 and AIME-class benchmarks. Use FP16 baseline for highest-stakes reasoning paths; FP8 acceptable elsewhere; avoid 4-bit. Pair with full-precision reasoning eval if production-critical.

FP16 critical · FP8 baseline

06PitfallsPitfalls and eval gates worth running.

  • Skipping the workload eval. The single most common mistake. Aggregate benchmarks tell you the average regression; your workload may be on the long tail. Always build a 100-200 prompt eval set against production patterns before committing.
  • Quantizing weights without quantizing KV cache. Weight quantization helps memory and weight-load throughput; KV cache quantization helps long-context throughput. They compound. A deployment with FP8 weights and FP16 KV cache is leaving half the gain on the table.
  • Mixing AWQ and GPTQ across deployment paths without consistency. Each method has different per-model quirks. Pick one and stay on it for a deployment so eval results are comparable. Switching mid-stream wastes a week of regression triage.
  • Trusting calibration data that doesn't match workload. AWQ and GPTQ both use a calibration dataset for their post-training calibration step. If the calibration set is generic (Wikipedia) and your workload is domain-specific (legal, medical, code), the activation distribution will not match and quality will be worse than benchmarks suggest. Use workload-representative calibration data.
  • Ignoring per-task variance. Some tasks are robust to 4-bit; some collapse. Long-form generation typically robust; short-answer factual recall typically sensitive. Build a task-class breakdown into the eval set, not just an aggregate score.

07ConclusionQuantization is free money if you measure.

Quantization decisions, April 2026

FP8 is the new FP16. AWQ-4 is the right trade when VRAM binds. Below 4-bit is research.

By April 2026 the quantization landscape is settled at the top of the ladder: FP8 is the right default for any deployment on Hopper or newer hardware, INT8 the cross-platform fallback, AWQ-4 the right call when VRAM is the binding constraint. The cost regressions are well-characterized, the throughput gains are measurable, and the implementation is one config flag in vLLM, SGLang, or TensorRT-LLM.

The discipline is in the eval. Aggregate benchmark numbers tell you what to default to; workload-specific evaluation tells you whether the default holds. Build the eval first; flip the flag second; keep the eval running on every deployment to catch regressions before they hit production. That is the entire playbook.

The next generation of compression — sub-4-bit production-grade, mixed-precision per-layer, quantization-aware training built into the model from initialization — is in active research and will land in 2027-2028. The teams that have built the muscle to evaluate quantization rigorously will adopt those advances quarter-by-quarter; the teams that haven't will keep paying full FP16 rack rate for the next two years.

Quantization-aware inference

Move past one-flag-fits-all. Quantize by workload.

We design and operate quantization-tuned inference deployments for engineering teams shipping open-weight models at scale — covering level selection, calibration-data curation, eval-set construction, and per-workload regression gates.

Free consultationExpert guidanceTailored solutions
What we work on

Quantization engagements

  • Level selection — FP8 vs INT8 vs AWQ-4 vs GPTQ-4
  • Workload-specific eval set construction
  • Calibration-data curation for AWQ / GPTQ
  • Cross-stack deployment — vLLM, SGLang, TensorRT-LLM
  • Regression gates and quarterly re-evaluation
FAQ · LLM quantization in 2026

The questions we get every week.

FP8 is 8-bit floating-point with a 4-bit exponent and 3-bit mantissa (E4M3) or 5-bit exponent and 2-bit mantissa (E5M2). INT8 is 8-bit signed integer with a per-tensor or per-channel scale. The key difference is dynamic range: FP8 represents a wider range of values, which matters at the tail of the activation distribution. On H100/H200/B100/B200 with native FP8 hardware support, FP8 wins on both quality (0.3-0.5 point regression vs INT8's 0.5-0.9) and throughput (1.5× vs 1.4× lift). On A100 or AMD MI300X with older drivers, INT8 wins because there's no native FP8 path. Default to FP8 on Hopper-class hardware; use INT8 elsewhere.