AI inference cost is now a variable line in gross margin — and the teams that treat it as such are finding that caching, batching, model routing, and quantization can cut managed API spend by 50–90% on typical production workloads, without touching model quality.
The challenge is sequencing. Most teams reach for quantization or routing first because those feel like the "engineering" levers — but prompt caching on a high-reuse workload returns more savings faster, at near-zero implementation cost, with zero accuracy risk. Getting the order wrong is how teams spend engineering cycles on diminishing returns while leaving the largest savings untouched.
This playbook covers five levers in priority order, provides the break-even math for the two most commonly misunderstood ones (prompt caching and semantic caching), and closes with the unit-economics metric the FinOps Foundation recommends for aligning engineering decisions to business outcomes: cost-per-successful-output, not cost-per-token.
- 01Prompt caching is the highest-ROI first move.Anthropic cache reads on Sonnet 4.6 cost $0.30/MTok vs $3.00/MTok uncached — a 90% reduction on repeated context. The break-even for the 1-hour cache is fewer than 3 reuses per hour. Most production workloads exceed this trivially.
- 02Async batching delivers a guaranteed 50% discount.OpenAI's Batch API cuts synchronous API rates by 50% with a 24-hour completion window. For classification, enrichment, and embedding pipelines that are not latency-sensitive, this is a zero-quality-loss cost halving.
- 03Model routing stacks accuracy gains on top of savings.Intelligent routers like Not Diamond claim 30%+ cost savings and 5%+ accuracy gains over single-model deployments by predicting the best model per query. These are vendor-stated figures — treat them as directional, and benchmark against your own workload.
- 04FP8 quantization is effectively lossless.A 500,000+ evaluation study across the Llama-3.1 family found FP8 quantization produces effectively zero accuracy degradation. INT8 adds 1–3%. INT4 loss is larger and model/task-dependent. Quantization is the right lever only at self-hosted scale.
- 05Cost-per-successful-output is the metric that matters.Cost-per-token is an input metric. If you optimize for it at the expense of retry rates, hallucination rates, or task completion rates, you can cut your token bill while raising your real cost-per-outcome. Track both.
01 — FrameworkFive levers, one priority order.
Most inference cost content covers one or two levers in isolation. The gap is a prioritized, sourced comparison that tells you which to pull first given your deployment model, workload shape, and tolerance for implementation complexity.
The table below assembles all five levers against consistent dimensions — sourced from Anthropic docs, OpenAI pricing pages, and two arXiv studies (2411.02355 and 2411.05276). The managed-vs-self-hosted split is the key filtering decision for teams on API-first architectures: quantization and continuous batching only apply at self-hosted inference scale. If you are calling managed APIs, your lever set is levers 1–4.
| Lever | Typical cost reduction | Accuracy impact | Effort | Managed API | Self-hosted only |
|---|---|---|---|---|---|
| Prompt / KV caching | Up to 90% on cache-hit tokens | None (exact match) | Low | Yes | No |
| Semantic caching | 61–69% fewer API calls (research) | >97% accuracy preserved | Medium | Yes | No |
| Async batching | 50% off sync rates (OpenAI) | None | Low | Yes | No |
| Model cascade / routing | 30%+ (vendor-stated) | +5% accuracy (vendor-stated) | Medium | Yes | No |
| Quantization (FP8) | Memory footprint & throughput | Effectively lossless | High | No | Yes |
| Quantization (INT8/INT4) | Greater memory savings | 1–3% (INT8); variable (INT4) | High | No | Yes |
| Sources: Anthropic Prompt Caching Docs; OpenAI Batch API Docs; arXiv:2411.02355 (quantization); arXiv:2411.05276 (semantic cache); notdiamond.ai (routing — vendor-stated). Retrieve current pricing before production decisions. | |||||
The practical insight embedded in this table: for API-first teams, the decision tree starts with caching (zero accuracy risk, highest per-token savings rate), moves to batching (simple API change, guaranteed 50% discount), and only then evaluates routing (which requires benchmark work against your specific prompt distribution). Quantization belongs to a later, separate initiative if and when you move to self-hosted inference at scale.
02 — Lever 1Prompt and KV caching — 90% off repeated context.
Prompt caching is the highest-return first move available on managed APIs. On Anthropic's platform, cache reads on Claude Sonnet 4.6 cost $0.30/MTok against a standard input rate of $3.00/MTok — a 90% reduction on every token that hits the cache. On Haiku 4.5, the cache-hit rate drops to $0.10/MTok vs $1.00/MTok base. The write-side cost is the only trade-off: a 5-minute cache write costs 1.25× the base input rate, and a 1-hour cache write costs 2×.
The break-even calculation for the 1-hour cache on Sonnet 4.6 is straightforward. Writing a cache entry costs $6.00/MTok (2× base). Each subsequent hit saves $2.70/MTok ($3.00 minus $0.30). Break-even is reached at approximately 2.3 reuses of the same cached prefix within the 1-hour TTL window. Any workload where the same system prompt or tool definitions are sent more than twice per hour is in the money on the 1-hour cache.
Three engineering details matter for production deployments:
- Cache breakpoint hierarchy. Anthropic supports up to 4 explicit cache breakpoints per request, applied in order: tools → system → messages. Changing any block at or before a breakpoint invalidates that level and all subsequent levels. Structure your prompt so the most-stable content (tools definitions, background context) sits before the most-volatile content (user query, conversation history).
- Cache pre-warming. Sending a pre-warm request with
max_tokens: 0incurs only a cache write charge and generates zero output tokens. This eliminates the cold-start latency penalty on the first real request after a cache miss or expiry — particularly useful before peak traffic windows. - Workspace isolation. As of February 2026, Anthropic prompt caches are isolated at the workspace level, not the organization level. Teams using multiple workspaces must plan cache warming per workspace. AWS Bedrock and Google Vertex AI maintain organization-level cache isolation — a meaningful operational difference for multi-team deployments.
OpenAI's cached input pricing follows the same direction: GPT-5.5 cached input runs $0.50/MTok against a $5.00/MTok standard — also a 90% reduction. GPT-5.4 cached input is $0.25/MTok vs $2.50/MTok. OpenAI's caching is automatic; you do not set explicit breakpoints, but the same structural discipline applies — stable content first.
| Model | Base input | Cache hit | 5m write cost | 1h write cost | Break-even (5m) | Break-even (1h) |
|---|---|---|---|---|---|---|
| Haiku 4.5 | $1.00 | $0.10 | $1.25 | $2.00 | ~1.4 reuses | ~2.2 reuses |
| Sonnet 4.6 | $3.00 | $0.30 | $3.75 | $6.00 | ~1.4 reuses | ~2.3 reuses |
| Opus 4.7 | $5.00 | $0.50 | $6.25 | $10.00 | ~1.4 reuses | ~2.2 reuses |
| All prices per million tokens. Break-even = write cost ÷ (base input − cache hit). Source: Anthropic Prompt Caching Docs, retrieved May 2026. Verify current rates before production decisions. | ||||||
03 — Lever 2Semantic caching — reuse responses across similar queries.
Prompt caching is exact-match: the same token sequence hits the cache, a different sequence misses it. Semantic caching extends the savings to semantically equivalent queries — questions that mean the same thing even if the wording differs. The mechanism: embed incoming queries, compare against a cache of past query embeddings, and return the stored response if the cosine similarity exceeds a tuned threshold.
A 2024 arXiv study (arXiv:2411.05276) applied this approach using Redis-backed embedding matching and reported API call reductions of 61.6–68.8% across query categories, with cache hit rates in that same range and positive-hit accuracy above 97%. The research setting used FAQ-style and support-style query distributions — the classes most likely to produce high hit rates. Production hit rates depend heavily on query diversity and similarity threshold tuning; treat the 68.8% figure as a ceiling for well-suited workloads, not a baseline expectation for general-purpose chat.
The embedding cost adds a new line to the calculation. For most workloads, embedding costs are an order of magnitude below inference costs — see our embedding model cost comparison for the current rate landscape — which means the semantic cache ROI is positive for any workload with meaningful query repetition. The implementation decision is the similarity threshold: too tight and hit rates collapse; too loose and accuracy degrades. Start at 0.95 cosine similarity and adjust based on sampled false-positive and false-negative rates.
04 — Lever 3Async batching — 50% off for non-latency-sensitive workloads.
The OpenAI Batch API delivers a flat 50% discount on standard synchronous API rates, a separate higher rate-limit pool, and a clear 24-hour turnaround window. For workloads that do not require real-time responses — enrichment pipelines, classification jobs, embedding generation, content moderation, report generation — the implementation cost is minimal and the savings are unconditional.
The API accepts up to 50,000 requests per batch file, with a maximum input file size of 200 MB. Batch creation rate limit is 2,000 batches per hour. Supported endpoints include /v1/responses, /v1/chat/completions, /v1/embeddings, /v1/completions, /v1/moderations, and /v1/images/generations. Output files are automatically deleted 30 days after completion.
One important interaction: OpenAI's data residency add-on adds a +10% price premium. Teams on data-residency plans can batch-process to partially net out the geo surcharge — the Batch API's -50% discount and the +10% data residency surcharge stack independently, resulting in a net ~45% discount on residency-enabled workloads.
OpenAI Batch API
Best for enrichment, classification, embedding, moderation, and report generation pipelines. Zero quality change — same model, same weights, scheduled execution. Output files auto-deleted after 30 days.
Continuous Batching (vLLM)
For self-hosted inference, continuous batching (vLLM) eliminates the idle-GPU problem of static batching. The 23× figure requires PagedAttention KV-cache optimization on top of continuous batching. Continuous batching alone delivers ~8× throughput improvement on TGI and Ray Serve baselines.
The Anyscale team summarized the self-hosted insight precisely: "LLM inference is memory-IO bound, not compute bound. In other words, it currently takes more time to load 1MB of data to the GPU's compute cores than it does for those compute cores to perform LLM computations on 1MB of data." Filling GPU memory with concurrent sequences via continuous batching is the principal throughput lever for on-prem deployments. A 13B-parameter model consumes roughly 1 MB of GPU state per token in a sequence — on an A100 40 GB GPU (after loading model weights), practical batch sizes are limited to approximately 28 sequences at 512-token length or 7 sequences at 2,048-token length without KV-cache compression.
05 — Lever 4Model cascade routing — match every query to the cheapest capable model.
Model routing addresses a structural inefficiency in single-model deployments: you pick one model for your workload, and that model is overkill for a large fraction of your queries. An intelligent router predicts, per query, which model is most likely to produce the right output at the right quality level — and routes accordingly, using cheaper models for simpler requests and escalating to frontier models only when the query complexity warrants it.
Not Diamond, whose customers include Dropbox, IBM, DoorDash, and American Express, claims 30%+ cost savings and 5%+ accuracy gains over single-model deployments for agent workloads. These are vendor-stated figures based on production workloads — not a scored public benchmark. Directionally, they are consistent with what you would expect: routing routes simpler queries to lower-cost models, which cuts the bill, and routes harder queries to better models, which can improve accuracy vs the single-model baseline.
The routing decision for your own stack depends on two inputs: the cost differential between your candidate models (see the Q2 2026 provider pricing matrix for current rates) and the complexity distribution of your query traffic. If 70% of your queries are simple enough for a $0.50/MTok model and 30% require $5.00/MTok capability, a perfect router saves you roughly 65% vs routing everything to the expensive model. Real routers are imperfect — benchmark on your own traffic before committing.
"Not Diamond significantly reduced our inference costs while also driving improvements in output quality."— Grant Miller, CEO and Co-founder of Replicated, notdiamond.ai
High-repetition query workloads
Semantic caching first (up to 68% API call reduction on FAQ distributions), then model routing for the cache misses. Do not build a routing system for workloads where caching can eliminate most calls.
Async classification & enrichment
OpenAI Batch API is the correct first move: 50% discount, zero code complexity, zero quality change. Routing adds incremental savings on top after you have captured the guaranteed 50%.
Multi-step agent loops
Routing is highest-value here: agent loops generate many sub-queries with highly variable complexity. Combine with token-budget controls — see our agent token budget framework for the cost-control layer.
On-prem GPU deployment
FP8 quantization is effectively lossless (arXiv:2411.02355, 500,000+ evaluations) and should be the default on self-hosted deployments. Stack continuous batching on top for throughput. INT8 is acceptable with 1–3% accuracy trade-off; INT4 requires per-task evaluation.
06 — Lever 5Quantization — the self-hosted throughput lever.
Quantization reduces the numerical precision of model weights and activations, shrinking the memory footprint and increasing throughput on self-hosted hardware. Since LLM inference is memory-IO bound, the throughput gains from fitting more of the model (and more concurrent sequences) into GPU memory are substantial.
A comprehensive Red Hat AI / IST Austria study (arXiv:2411.02355) ran 500,000+ evaluations across the Llama-3.1 family and reached three clear conclusions: FP8 quantization (W8A8-FP) is effectively lossless across all evaluated tasks and model scales; well-tuned INT8 (W8A8-INT) achieves only 1–3% accuracy degradation; INT4 degradation is larger and model- and task-dependent. The paper does not give a universal INT4 percentage — do not use a single number for INT4 without benchmarking on your specific model and task combination.
Effectively lossless
Red Hat AI / IST Austria study across the full Llama-3.1 family (8B → 405B). FP8 (W8A8-FP) is the recommended default for self-hosted inference — no quality trade-off.
Acceptable for most tasks
Well-tuned INT8 (W8A8-INT) achieves 1–3% accuracy degradation on the Llama-3.1 family. Acceptable for most production use cases; benchmark on your specific tasks before shipping.
vLLM throughput vs naïve
Up to 23× throughput improvement over naïve static batching on self-hosted inference with vLLM. Requires PagedAttention KV-cache optimization; continuous batching alone delivers ~8× improvement.
The original interpretation here: FP8 has crossed the threshold from "experimental optimization" to "production default." When a 500,000-evaluation study across an entire model family finds effectively zero accuracy degradation, the question shifts from "should we use FP8?" to "why are we still running BF16?" on self-hosted deployments. INT8 is the correct choice when hardware does not support FP8 natively. INT4 remains a case-by-case engineering call that requires workload-specific benchmarking.
Looking forward, the FP4 quantization-aware training pioneered in models like DeepSeek V4 suggests the next quantization frontier for self-hosted inference. Purpose-built hardware could make FP4 roughly 1.33× more efficient than FP8 — but this is a hardware-dependent projection, not a currently deployable production option for most teams.
07 — Unit EconomicsThe metric that matters: cost-per-successful-output.
The FinOps Foundation identifies cost-per-token as the foundational new metric for AI cost management — but "foundational" does not mean "sufficient." Cost-per-token is a raw input metric. If you optimize for it at the expense of retry rates, hallucination rates, or task completion rates, you can cut your token bill while raising your real cost-per-business-outcome.
The unit-economics metric that aligns engineering decisions to business value is cost-per-successful-output: total inference cost divided by the number of outputs that pass your quality gate. A model that costs $0.50/MTok but fails 40% of tasks has an effective output cost roughly 1.67× that of a model costing $0.70/MTok with a 5% failure rate. The cheaper model costs more to operate.
Instrumenting cost-per-successful-output requires connecting your inference cost data to your evals and traces layer. The observability tooling for this — structured cost tracking per request, per pipeline stage, per feature — is covered in our agent observability guide. The cost tracking layer and the quality measurement layer need to be set up simultaneously; instrumenting cost alone, without a quality gate, produces the cost-per-token trap described above.
Inference cost as % of baseline · illustrative optimization stack
Illustrative composition. Individual savings depend on workload shape. Source levers: Anthropic docs; OpenAI Batch API docs; notdiamond.ai (vendor-stated).08 — FinOps ProgramStanding up a FinOps program for AI inference.
The FinOps Foundation frames the organizational shift required for AI cost management around three principles adapted from traditional cloud FinOps: Price × Quantity = Cost still applies; AI service costs appear in cloud billing data and are therefore trackable; and tagging/labeling is possible but requires adjustments for shared environments, training costs, and API-based resources.
Two structural challenges distinguish AI FinOps from cloud FinOps: GPU scarcity (which makes provisioning decisions different from compute elasticity) and volatile pricing (inference rates have moved dramatically year-over-year and will likely continue to). The FinOps Foundation recommendation is to regularly track and review AI costs and usage, set quotas, tag resources, and optimize GPU allocation. Our addition: tag at the feature level, not just the team level, so you can attribute cost-per-successful-output per customer-facing capability.
For teams managing agent token budgets, the FinOps layer adds the business-outcome dimension on top of the engineering constraint layer. Token budgets prevent runaway spend on individual requests; FinOps attribution tells you which product features are generating positive unit economics vs burning margin. Both are necessary; neither is sufficient alone.
For businesses evaluating the wider strategic shift to AI-native operations — including how to structure cost accountability as AI becomes a COGS driver — our AI transformation engagements cover the governance, vendor selection, and unit-economics instrumentation decisions that a FinOps program requires.
Cut spend first where it costs nothing — then layer the harder levers.
The four-lever framework in this playbook is not a menu of equally weighted options — it is a priority-ordered decision tree. Prompt caching on high-reuse workloads returns 90% savings on cache-hit tokens at near-zero implementation cost and zero accuracy risk. Async batching delivers a guaranteed 50% discount on non-latency-sensitive pipelines with minimal code change. Routing stacks additional savings on top once you have captured the easy wins. Quantization belongs to a later initiative if and when you move to self-hosted scale.
The teams that will get this wrong are those who reach for the architecturally interesting levers first — routing infrastructure, quantization pipelines — before enabling the trivial ones. A two-line change to add cache breakpoints to your system prompt returns more savings in the first week than months of routing infrastructure work for most workloads.
The broader framing is the unit-economics shift the FinOps Foundation identifies: cost-per-token is not the metric that aligns engineering to business outcomes. Cost-per-successful-output is. Until your cost instrumentation connects inference spend to quality signals, you are optimizing an input metric that can improve while your real cost-per-outcome worsens. Build both layers together, or one will undermine the other.