LLM cost vocabulary diverges across vendors, even though the underlying mechanics are similar. OpenAI calls it "cached input"; Anthropic calls it "prompt caching"; Google calls it "implicit caching." The math is approximately the same; the contracts and forecasts depend on naming precisely what's being charged.

This reference holds 60+ terms across five families: token mechanics, cache vocabulary, pricing tiers, discount structures, and operational metrics. Each entry has a definition, vendor-specific notes, and a worked example for finance teams.

Use it as the translation table when you build LLM cost forecasts, negotiate enterprise contracts, or design cost-aware system architectures.

Key takeaways

01
Rack-rate forecasting overstates true LLM spend by 3-10× at production scale.Cache hits, batch discounts, and reserved capacity all reduce blended rate substantially. Forecast against blended rate; use rack rate only for cold-start cost ceilings.
02
The four cost-driving terms: prefill, decode, cache write, cache read. Know what each does to your bill.Prefill is cacheable; decode is not. Cache write is more expensive than uncached read; cache read is dramatically cheaper. Most cost optimization centers on these four.
03
Reasoning tokens are billed at output rates but invisible. Budget separately.Claude extended thinking and OpenAI reasoning effort consume tokens that don't appear in the response but are charged at output rates. Track separately or you'll under-forecast.
04
Cost per task beats cost per token for executive reporting.$/token forecasts hide token efficiency, success rate, and retry cost. Cost per successful task captures all three in one number — the metric to defend in budget reviews.
05
Vendor naming differs but mechanics converge. Build your forecasting model in vendor-neutral terms.OpenAI 'cached input' = Anthropic 'prompt cache read' = Google 'implicit cache hit'. Maintain a vendor-neutral cost model and translate at the contract boundary.

01 — Family 01Token mechanics.

The basic billing units. Different from words; tokenization varies by model family.

Token. The basic billing and computation unit. Roughly 0.75 English words per token; varies by tokenizer.

Input token. A token in the context fed to the model — system prompt, user message, tool results, retrieved context. Generally cheaper than output tokens.

Output token. A token generated by the model. Charged at a higher rate than input tokens (typically 3-10× input rate).

Reasoning token. Internal-reasoning tokens consumed by Claude extended thinking or OpenAI reasoning effort. Billed at output rates but invisible in the response. Track as a separate budget line.

Tokenizer. The model component that converts text to tokens. BPE (Byte-Pair Encoding) variants dominate; Tiktoken (OpenAI), GPT-4 tokenizer, and sentencepiece are common implementations.

Token-to-word ratio. Approximate characters/words per token in a given tokenizer. English ~4 chars/token; languages with longer words or non-Latin scripts have lower ratios.

Prefill. The model's first pass over the input, computing attention states. Cost-dominated by context length; latency-dominated by hardware throughput. Heavily cacheable.

Decode. Token-by-token output generation after prefill. Dominated by per-token throughput; non-cacheable. Decode latency scales with output length.

Time-to-first-token (TTFT). Latency between request and first output token. Dominated by prefill cost.

Tokens-per-second (TPS). Decode-phase throughput. Determines steady-state latency.

Context window. The maximum input token count the model can attend to in a single call. 1M is the current frontier headline (Claude Opus 4.7, GPT-5.5, DeepSeek V4); Gemini ships 2M.

Output token limit. The maximum tokens the model can generate in one response. Configured per request via max_tokens or equivalent. Differentiated from context window.

Effective context. The portion of the window the model can attend to with stable quality. Often shorter than nominal — long-context degradation is documented across all frontier models.

Phase 1

Prefill

input tokens · attention pre-compute

Cost = context length × input rate. Heavily cacheable. Drives TTFT.

Cacheable

Phase 2

Decode

output tokens · sequential generation

Cost = output length × output rate. Not cacheable. Drives steady-state latency.

Non-cacheable

02 — Family 02Cache vocabulary.

How LLM caching actually works at the cost layer. The most-leveraged vocabulary in any production cost optimization.

Prefix cache. Pre-computed attention states for a recurring input prefix. When the next request shares the prefix bytes, the model reads cached states instead of recomputing from scratch.

KV cache. Per-token key-value attention state held in memory during decode. The structural reason decode performance is what it is.

Cache write. The first request that establishes a cached prefix. Charged at a write premium (typically 1.25× input rate).

Cache read. Subsequent requests that hit the cached prefix. Charged at a sharply discounted rate (typically 10% of input rate).

Cache hit. A request that successfully reads from cache. Charged at cache-read rate.

Cache miss. A request that doesn't share cached prefix bytes. Recomputes and writes new state at cache-write rate.

Cache hit rate. Fraction of requests served by cache reads. The leverage metric on long-context economics.

Cache TTL. Time-to-live for cached state. Anthropic offers 5-minute (default), 1-hour, and 24-hour tiers with progressively higher write premiums. OpenAI handles caching transparently with implicit TTL.

Cache marker. An explicit checkpoint in the prompt declaring where caching should occur. Anthropic uses up to four cache markers per request; OpenAI handles implicitly.

Implicit caching. Caching managed by the provider without explicit user markers. OpenAI and Google use implicit caching as the default model.

Explicit caching. User-specified cache markers in the prompt. Anthropic's primary model. Gives developers control; requires understanding of cache mechanics.

Cache invalidation. The point at which cached state becomes stale and must be recomputed. Caused by prefix-byte changes (any modification to the cached section invalidates) or TTL expiration.

Cache prime. The first request that establishes a cache. Pays the write premium; subsequent requests amortize against the prime cost.

Break-even hit count. Number of cache hits required to amortize the cache-write premium. Drives cache-tier (5-min vs 1-hour vs 24-hour) selection.

"Cache hit rate is the single highest-leverage metric in production LLM cost. Track it; alert on regressions."— Internal cost-optimization retro, May 2026

03 — Family 03Pricing tiers.

How vendors price across access patterns. Five canonical tiers govern most enterprise LLM spend.

Rack rate. The published per-token price without volume or commitment discount. Used for ad-hoc API access; never the right number for production forecasting at scale.

Standard tier. Real-time inference at published rates. Default access pattern; same as rack rate unless modified by volume discount.

Batch tier. Asynchronous processing with 24-hour SLA. Typically 50% discount off standard. OpenAI Batch API, Anthropic Batch API, AWS Bedrock Batch all offer this tier.

Provisioned throughput (PTU). Dedicated capacity at a fixed hourly rate. Guaranteed latency; expensive at low utilization. Microsoft Azure's branded term; concept exists across AWS Bedrock, GCP Vertex.

Reserved capacity. Pre-purchased throughput at committed-spend discount. Enterprise-only on most vendors. Trades flexibility for predictability.

Volume tier. Discount that kicks in at spending thresholds. Typically negotiated; varies widely by vendor and spend level.

Free tier. Capped free usage; useful for evaluation but not production. Limits and rate caps apply.

Tier 1 / Tier 2 / Tier 3. Vendor-specific access tiers with different rate limits. OpenAI uses these as labels for usage-based access escalation.

Per-request rate limit. Cap on requests per minute or per second. Tier-dependent; affects both cost (because of retry-on-throttle) and architectural choices.

Per-token rate limit. Cap on tokens consumed per minute. Often the binding constraint at scale.

Default

Standard

1tier

Real-time at rack rate. Default for development; baseline for production forecasting.

Standard

Cost lever

Batch

1tier

50% discount; 24-hour SLA. Use for evaluation runs, content generation, background scoring.

Batch

Predictability

Provisioned · reserved

2tiers

Dedicated capacity (provisioned) and committed spend (reserved). Trade flexibility for predictability.

Enterprise

04 — Family 04Discount structures.

How discounts compose. Multiple discounts can stack; naming each precisely keeps forecasting honest.

Volume discount. Discount applied at spending thresholds. Typically negotiated for enterprise agreements.

Commitment discount. Discount tied to committed annual or quarterly spend. Reserved capacity falls under this.

Cache discount. The price reduction on cached input tokens. Anthropic offers 90% off (cached reads at 10% of input rate); OpenAI offers similar via implicit caching.

Batch discount. The 50% discount typically applied to async batch-tier requests.

Context-window discount. Some vendors (notably Google) offer reduced rates for tokens beyond a certain context length. Inverse to the standard pricing shape.

Multi-modal pricing. Image, audio, or video inputs are priced separately, often as a token equivalent. Image inputs typically cost 1-3× equivalent text input.

Function-call premium. Some vendors charge a premium for function-call (tool-use) tokens. Mostly deprecated as of 2026; legacy pricing on some APIs.

Fine-tuned model premium. Higher per-token rate for fine-tuned vs base models. Typically 3-6× depending on vendor.

Off-peak discount. Some vendors offer time-based discounts for non-peak inference. Less common than batch.

Stacking. Multiple discounts applied together — e.g., batch + cache + commitment. Most discounts stack; check vendor terms for exclusions.

05 — Family 05Operational metrics.

How LLM cost is actually measured day-to-day. The vocabulary for finance and engineering shared dashboards.

Cost per task. Total spend per successfully completed end-to-end task. The single metric worth defending in executive reviews.

Cost per attempt. Total spend divided by total runs (including failures). Diagnostic complement to cost-per-task.

Cost per 1K requests. Total spend divided by request count, scaled to thousands. Easier to compare across systems.

Cost per query. For information-retrieval workloads: total spend divided by user queries. The user- facing economic unit.

Blended rate. The effective $/token after cache hits, batch discounts, commitment, and other discounts factor in. Always lower than rack rate; the number to use for scale forecasting.

Gross margin per query. Revenue per query minus cost per query. The unit economics of an LLM-powered product.

Token efficiency. Output tokens per successful task. Lower is better; captures whether the model is verbose or concise relative to value delivered.

Spend per active user. Total spend divided by active user count. Scales naturally with usage; trends shift with engagement.

Run-rate spend. Annualized projection of current monthly spend. Used for budget planning; should be paired with growth assumptions.

Cost regression. An unexpected increase in cost per task between releases. Signals prompt inefficiency, model swap, or cache hit-rate drop. Production cost monitoring should alert on this.

Cache write ratio. Ratio of cache writes to cache reads. High ratio (above 0.4) signals topology problems — your cache markers aren't aligned with the actual reuse pattern.

Output amplification ratio. Output tokens divided by input tokens. Captures whether long-context inputs elicit disproportionate outputs. The classic silent-cost driver.

The two metrics worth defending

Cost per task is the only metric that survives across model releases, pricing shifts, and architectural changes. Blended rate is how you forecast spend after discounts kick in. Track both; ignore $/1K-tokens for executive reporting.

06 — Family 06Vendor-specific notes.

Where vendors diverge on naming and mechanics. Useful for multi-vendor cost models.

OpenAI. Uses "cached input" (implicit caching). Rate limits structured by tier (Tier 1-5). Batch API with 50% discount and 24-hour SLA. Reasoning effort adds output-rate-billed tokens.

Anthropic. Uses "prompt caching" with explicit cache markers. Three TTL tiers (5-min, 1-hour, 24-hour). Cache write at 1.25× input rate; cache read at 10% of input rate. Extended thinking adds output-rate- billed tokens.

Google (Vertex AI / Gemini API). Uses "implicit caching" and "context-window discount." Reduced rates beyond certain context lengths. Different pricing model from OpenAI/Anthropic.

AWS Bedrock. Hosts multiple model families with vendor-specific pricing rolled into Bedrock surface. Provisioned throughput available; cross-region inference capable.

Azure OpenAI. Microsoft's hosted OpenAI. Pricing closely mirrors OpenAI direct; provisioned throughput unit (PTU) model dominant for enterprise.

Together AI. Hosts open-weight models at competitive rates. Per-token pricing for Llama, Mistral, DeepSeek, Qwen, and others.

Fireworks. Open-weight inference platform. Per-token + dedicated endpoints.

Replicate. Function-as-a-service style pricing. Per-second billing for some workloads; per-token for others.

Self-hosting (vLLM, SGLang, TGI). No per-token charges; cost is GPU compute. Different vocabulary entirely (utilization, KV cache size, batch size, quantization). Right at sustained high-volume.

"Multi-vendor cost models pay off the moment any one vendor changes pricing. Build it once; translate at the contract boundary."— Internal multi-vendor cost retro, March 2026

07 — ConclusionToken economics rewards vocabulary precision.

The shape of LLM cost vocabulary · April 2026

Forecast against blended rate; report cost per task; track cache hit rate.

Token economics is where engineering and finance meet. Vocabulary precision pays off twice: in forecasting accuracy (rack-rate forecasts overstate true spend by 3-10× at scale) and in vendor negotiation (knowing exactly what's discounted gets you better terms).

The 60+ terms in this glossary cover the working vocabulary of LLM cost in Q2 2026. Three metrics drive most reporting: cost per task (executive metric), blended rate (forecasting input), cache hit rate (optimization lever). Build dashboards on these three; layer diagnostics underneath.

Update vocabulary when major vendors change pricing structure (OpenAI batch tier expansion 2025; Anthropic cache TTL tier additions; Google context-window discount evolution). Maintain a vendor-neutral cost model and translate at the contract boundary.

Token Economics Vocabulary: LLM cost glossary.

01 — Family 01Token mechanics.

Prefill

Decode

02 — Family 02Cache vocabulary.

03 — Family 03Pricing tiers.

Standard

Batch

Provisioned · reserved

04 — Family 04Discount structures.

05 — Family 05Operational metrics.

06 — Family 06Vendor-specific notes.

07 — ConclusionToken economics rewards vocabulary precision.

Forecast against blended rate; report cost per task; track cache hit rate.

Stop forecasting at rack rates.

Token-economics engagements

The cost questions we get every week.

Continue exploring LLM economics references.

AI Marketing Acronym Master List 2026: 250+ Terms

AI Compliance & Governance Glossary 2026: 100 Terms

Harness Engineering: Writer's 40% Token-Spend Cut, Decoded

The AI Cost Reckoning: Right-Sizing Model Spend 2026