Embedding model cost is the most miscounted line item in a production RAG budget. The per-1M-token headline price hides the two numbers that actually decide the bill — refresh cadence and embedding dimensionality — and the wrong vendor choice compounds month after month for as long as the corpus keeps changing. This calculator reframes the decision around the math that matters.

What's at stake: a 1M-chunk corpus at 500 tokens per chunk is 500M tokens of initial ingest, and a 5% monthly refresh rate adds another 25M tokens every month indefinitely. Pick the vendor that looked $20 cheaper on a one-off ingest and you can spend the difference back inside a quarter once the refresh cycle starts running. The decision is structural, not price-list.

This guide compares five frontier embedding providers — OpenAI, Cohere, Voyage, Mistral, and Jina — across per-token pricing, dimensionality tradeoffs, MTEB and BEIR quality scores, refresh-cycle math over a 12-month horizon, and break-even corpus sizes where each vendor wins. Five archetypes at the end. Numbers sourced from each vendor's public pricing pages and benchmark reports as of May 4, 2026.

Key takeaways

01
OpenAI text-embedding-3-large is the strong default.Quality plus price stability plus matryoshka truncation make it the lowest-regret choice for most teams shipping their first or second production RAG pipeline. Start here unless you have a measured reason to leave.
02
Voyage-3 is the quality leader on technical corpora.Domain-specific variants (code, finance, law) reportedly outperform generalist competitors on retrieval recall by meaningful margins. Worth the premium when your corpus is jargon-heavy and accuracy compounds downstream.
03
Cohere Embed v4 is the multilingual winner.Coverage across 100+ languages with retrieval quality holding up far better than translate-then-embed alternatives. Use when the corpus or the query distribution crosses language boundaries.
04
Mistral Embed is the cost leader at high volume.Lowest per-1M-token list price among the frontier providers, with quality that holds up on generalist English-language corpora. Use when token budget rather than headline accuracy caps the design.
05
Refresh-cycle math reframes the decision.Year-1 cost is dominated by initial ingest; year-3 cost is dominated by refresh. A vendor that looks expensive at ingest can win on three-year TCO once delta-token volume compounds. Compare on the full horizon.

01 — Five VendorsOpenAI, Cohere, Voyage, Mistral, Jina — same task, different curves.

Five providers ship frontier-tier general-purpose embedding models into the market today. They share the same job to be done — turn a chunk of text into a dense vector that supports semantic similarity search — but they price, dimension, and benchmark differently enough that the right pick depends on the corpus, not the headline.

Below is the snapshot. Each card lists the flagship model, its native dimension, its standout positioning, and the workload class where the vendor most often earns a place on a shortlist. Read this as the map before reading the math.

OpenAI

3072d

text-embedding-3-large

Matryoshka-trained — truncate to 256, 512, 1024, or 1536 dimensions with graceful quality decay. Strong on MTEB English. The lowest-regret default for most production RAG pipelines today.

Default · Matryoshka

Cohere

1024d

Embed v4

100+ language coverage with retrieval quality that holds up materially better than translate-then-embed approaches. Compressed embedding variants for cost-sensitive deployments. The multilingual specialist.

Multilingual

Voyage

1024d

voyage-3 + domain variants

Domain-specific models for code, finance, and law that reportedly outperform generalist competitors on retrieval recall in their respective sectors. Premium pricing, premium accuracy on jargon-heavy corpora.

Quality leader

Mistral

1024d

mistral-embed

Lowest per-1M-token list price among the frontier providers as of May 2026, with quality that holds up on generalist English-language corpora. The volume play when the corpus is large and the budget is tight.

Cost leader

Jina

1024d

Jina Embeddings v4

Open-source weights plus a hosted API. The self-host story is the strongest of the five — useful for sovereignty-bound deployments and teams that want to retire the API dependency entirely over time.

Open weights

The five-vendor map is stable enough to plan against for the rest of 2026, but pricing pages change quarterly and benchmark leaderboards shift as new models ship. Treat the cards above as the directional positioning; treat the numbers in §02–§04 as valid-as-of-May-2026 and re-verify before committing to a multi-year roadmap.

Out of scope

This guide compares frontier general-purpose embedding APIs. We deliberately exclude image and multimodal embeddings, on-device or sentence-transformer self-hosted variants without an API equivalent, and rerankers (which are a separate cost line and a separate quality lever). For self-hosted RAG patterns including reranker integration, see our pgvector tutorial.

02 — Per-Token PricingRealtime vs batch pricing across providers.

Per-1M-token list prices are the easy number to grab off a pricing page; what they hide is the realtime / batch split. Most embedding API providers now offer a discounted batch tier for asynchronous ingest workloads — useful precisely because initial corpus ingest and large monthly refreshes can run as batch jobs. Live query embedding still runs at realtime rates.

The matrix below is the operating choice for each vendor — when to pick realtime, when to pick batch, what the practical pricing tier delivers, and the workload class it fits. Numbers are illustrative-tier comparisons; verify exact rates on each provider's current pricing page before contracting.

OpenAI · Realtime

text-embedding-3-large at list rate

The right tier for query-side embedding, where TTFB matters more than per-token savings. Pair with batch ingest below to split the workload sensibly. Reliable, stable pricing, the lowest-regret default for live retrieval.

Pick for query-side

OpenAI · Batch

Async ingest tier — 50% off list

Use for the initial corpus ingest and any monthly refresh runs that don't need sub-second turnaround. Submit as a batch file, retrieve results within the published SLA window. Effectively halves the dominant cost line for most production RAG budgets.

Pick for ingest

Cohere Embed v4

Realtime with compressed variants

Compressed embedding variants drop the per-vector storage and bandwidth cost while keeping retrieval quality close to the full-dimension model. Useful when storage is a more binding constraint than per-call latency. Multilingual workloads win here.

Pick for multilingual

Voyage-3

Premium realtime · no public batch tier

Higher per-token price than the OpenAI / Mistral pair, justified by domain-specific quality on technical corpora. No published batch discount as of May 2026 — model the full ingest cost at list rate when comparing TCO. Premium positioning, premium math.

Pick for domain RAG

Mistral Embed

Volume tier — lowest per-token list price

The cost leader at the top of the table when the corpus is large and the budget is tight. Generalist English-language quality holds up; coverage outside generalist English is narrower than Cohere or OpenAI. Verify on your corpus before committing.

Pick for volume

Jina v4

Hosted API + self-host fallback

The hosted API competes on price with Cohere and OpenAI; the open-weight self-host option is the structural advantage. Useful when sovereignty or long-horizon-cost reasons argue for the option to retire the API dependency entirely over time.

Pick for sovereignty

The most common deployment we see in 2026 splits the workload: a batch tier for initial ingest and monthly refresh, a realtime tier from the same vendor for query embedding. Splitting across vendors is technically possible but operationally painful — you'd need to re-embed the entire query stream every time the ingest vendor changes, since cross-vendor vectors are not comparable.

One quiet gotcha worth flagging: free-tier or low-volume promotional pricing rarely survives the move to production traffic. Always model the cost calculation against the volume-tier rate you'll actually pay, not the introductory rate the pricing page leads with. The numbers in §04 use production-tier rates throughout.

"Per-token list price is the easy number. Refresh-cycle math is the real bill."— Digital Applied engineering team, post-mortem on a 14M-chunk migration

03 — DimensionalityQuality vs storage cost — the tradeoff.

Embedding dimension is the second-most-miscounted variable in a production RAG budget. The default for many providers ships at 1024 or 1536 dimensions; OpenAI's flagship goes to 3072. The cost of storing and searching those vectors scales linearly with dimension count, and the cost of returning a vector to a client (network bandwidth, JSON payload size) scales linearly too.

What changed in 2024–2025 is the broad adoption of matryoshka representation learning — models trained so that the first kdimensions of the embedding remain useful on their own. A matryoshka-trained 3072-dim embedding can be truncated to 1536, 1024, 512, or 256 dimensions with quality decay that is much gentler than naive truncation of a non-matryoshka model. OpenAI's text-embedding-3 family is the canonical example.

Full

3072 dimensions

OpenAI text-embedding-3-large default

Maximum retrieval quality the model can deliver. Storage cost in pgvector or any dense store is roughly 12 KB per vector at fp32, 6 KB at fp16. Use when accuracy on hard retrieval queries is the binding constraint.

Quality ceiling

Balanced

1024–1536 dimensions

Matryoshka-truncated default for most production deployments

Quality decay reportedly under 1% on MTEB English versus full 3072. Storage roughly halves. Index build time and query latency improve in line with the dimension drop. The sweet spot for ~80% of production RAG today.

Production default

Compressed

256–512 dimensions

Aggressive truncation for cost-bound or edge deployments

Quality decay becomes visible on hard retrieval queries — name disambiguation, jargon, near-duplicate detection. Storage drops to roughly 1–2 KB per vector. Worth measuring on your own queries before deploying; not a universal win.

Edge / cost-bound

The decision framework is straightforward in concept and harder in practice: measure recall on a held-out retrieval set at full dimension, then re-measure at 1536, 1024, 512, and 256. Pick the smallest dimension where recall@10 drops by less than a threshold you set — most teams settle on 1–2 percentage points of MTEB-equivalent loss as the acceptable margin.

For non-matryoshka models (some Cohere and Voyage variants), the tradeoff is sharper — truncating a non-matryoshka vector tends to degrade quality much faster, so the practical move is to pick the dimension at training time and accept that as the operating point. Check each vendor's docs for whether matryoshka-style truncation is supported before betting the architecture on it.

Storage math

A 1M-chunk corpus at 1536 dimensions in pgvector consumes roughly 6 GB of dense storage at fp32, before index overhead. The same corpus at 3072 dimensions consumes 12 GB. Multiply by your shard count and replication factor; the dimension choice often dominates the storage line in a Postgres or dedicated-vector-DB bill.

04 — Refresh MathInitial ingest + (refresh × delta × cost) over 12 months.

The refresh-cycle formula is the calculator that this whole article exists to put in front of the reader:

12_month_cost
  = initial_ingest_cost
  + (refresh_rate_per_month × delta_chunks × tokens_per_chunk × cost_per_million_tokens × 12)

# where:
#   initial_ingest_cost = total_chunks × tokens_per_chunk × cost_per_million_tokens
#   refresh_rate_per_month = fraction of corpus that changes monthly (typically 2–10%)
#   delta_chunks = total_chunks × refresh_rate_per_month
#   tokens_per_chunk = average chunk length (typically 300–800)

Worked example. A 1M-chunk corpus at 500 tokens per chunk is 500M tokens of initial ingest. At a 5% monthly refresh rate, the delta is 50K chunks per month, or 25M tokens of recurring ingest. Over 12 months that is 300M tokens of refresh on top of the 500M-token initial ingest — refresh becomes roughly 38% of year-1 token volume, and crosses 50% somewhere in year 2 for most realistic corpora.

Token volume by year · ingest vs refresh on a typical RAG corpus

Source: Digital Applied modelling · 1M-chunk corpus · 500 tokens / chunk · 5% monthly refresh

Year 1 — ingest500M tokens · initial corpus ingest

100%

Year 1 — refresh300M tokens · 5% monthly × 12 months

60%

Year 2 — refresh only300M tokens · no fresh ingest

60%

Year 3 — refresh only300M tokens · cumulative refresh dominates

60%

3-year total refresh share900M tokens of refresh vs 500M ingest

64%

The lesson is simple and chronically underweighted: refresh dominates total cost on any corpus that lives longer than a year. A vendor that looked $200 cheaper on initial ingest can cost $600 more over three years if the per-token rate for steady-state refresh is structurally higher. Compare on the full horizon, not the launch month.

Two operational levers cut refresh volume materially. First, checksum-gated re-embedding: hash each chunk on ingest, compare against the stored hash on refresh, and only re-embed when the content actually changed. On a typical mixed content corpus (some documents drift, most don't), this often cuts refresh volume by 60–80% versus naive re-embed-all. Second, scheduled batching: queue refresh deltas across a week and submit them as a single batch job at the discounted rate, rather than firing realtime embedding calls on every content change.

Initial ingest

500M

Tokens · 1M-chunk corpus

One-time cost the first time the corpus is embedded. Almost always run as a batch job at the discounted rate. Dominates year-1 cost on a freshly launched RAG pipeline.

One-time

Monthly refresh

25M

Tokens · 5% refresh rate

Recurring cost as documents update, new pages publish, and source content drifts. Refresh rate varies materially by corpus type — product docs drift slowly, news corpora drift fast.

Recurring

Query embedding

Variable per traffic

Live query token volume per month scales with user traffic, not corpus size. Modest in absolute terms for most B2B RAG (~1–10M tokens/month), but always realtime-priced so the per-token rate matters.

Realtime

05 — Quality ParetoMTEB and BEIR — where each vendor wins.

Headline benchmark scores tell only the broadest version of the quality story. The two public leaderboards worth tracking are MTEB (Massive Text Embedding Benchmark) for breadth across 50+ retrieval, clustering, and classification tasks, and BEIR for depth on retrieval specifically. Both shift as new models ship; both still tell you which vendor leads at a given moment in time.

The chart below is an illustrative reading of where each frontier vendor sits on the MTEB English retrieval subset and the BEIR average as of May 2026. Numbers shift; the cluster-by-cluster shape is what matters for shortlist construction.

Quality positioning · MTEB English retrieval × BEIR average

Illustrative as of May 2026 · re-verify on the MTEB and BEIR leaderboards before committing

OpenAI text-embedding-3-largeGeneralist English · stable leadership over 18 months

~64.6

Default leader

Voyage-3Domain variants edge generalist on technical corpora

~64.0

Domain leader

Cohere Embed v4Multilingual lead — quality holds across 100+ languages

~61.5

Multilingual

Jina Embeddings v4Strong open-weight option, narrower than the top three on English MTEB

~60.0

Mistral EmbedCompetitive English generalist, lowest list price

~58.6

MTEB English retrievalBEIR average

The cluster shape worth seeing: three vendors sit inside a narrow band at the top of the English-generalist benchmark (OpenAI, Voyage, Cohere), with Jina and Mistral a few points behind. Domain-specific variants from Voyage close or invert the gap on technical corpora — code, finance, law — where the domain model was actually trained to win.

Two practical takeaways. First, the top-of-leaderboard gap is narrowin 2026 — typically inside 5 points on MTEB English retrieval. If the per-token cost difference between vendors is 3–5× and the quality difference is <5 points, cost-led selection is defensible for most corpora. Second, the benchmark gap on your corpus may differ materially from the public leaderboard. Always run a small held-out retrieval eval on your own data before locking in a vendor for a multi-year roadmap.

On benchmark fidelity

MTEB and BEIR are valuable for shortlist construction, not vendor selection. A 5-point MTEB gap can invert on your corpus if your content distribution differs materially from the public benchmark mix. Treat the leaderboard as the broad map; run your own evals before the bet.

06 — Break-EvenCorpus size where each vendor wins on cost.

With the cost formula in §04 and the quality cluster in §05, the break-even analysis falls out naturally. Each vendor wins on TCO somewhere on the corpus-size × refresh-rate plane; the structural break-evens are stable enough to plan against even as exact rates shift.

The matrix below maps corpus-size bands to the vendor that most often wins year-3 TCO at typical refresh rates, assuming production-tier pricing and ~5% monthly refresh on a mixed-content English-language corpus.

Under 100K chunks

Pricing is a rounding error

Initial ingest is under $20 at any vendor; year-3 TCO is dominated by query embedding which is modest at this scale. Pick on quality, integration, and the rest of the stack. OpenAI's the lowest-regret default.

Pick OpenAI

100K–1M chunks

Mid-band — vendor choice matters but is reversible

Year-3 TCO spreads start to bite — typically a 2–3× cost gap between cheapest and most expensive vendor on a typical refresh profile. Cost-led selection becomes defensible if quality on your eval set is comparable. Mistral or Cohere often win.

Mistral / Cohere

1M–10M chunks

Cost dominates — pick the cheapest acceptable quality

Initial ingest crosses $1K, year-3 TCO crosses $5K. Cost-led selection is the right call unless a measured quality reason rules out the cheapest vendor. Mistral leads on list price; Cohere wins if multilingual quality matters.

Mistral or Cohere

10M+ chunks · sovereignty

Self-host or hybrid

At this scale the hosted-API token bill compounds materially. Open-weight Jina v4 self-hosted on your own hardware becomes structurally attractive, particularly when sovereignty or compliance argues for retiring the API dependency anyway.

Self-host Jina

Domain-specific

Quality compounds downstream — pay the premium

Code, finance, law, biomedical — domains where a 2-point retrieval recall lift drives a meaningful answer-quality lift downstream. Voyage's domain models are usually worth the premium per-token price; measure the recall lift on your corpus first.

Voyage domain variant

The break-evens above assume the refresh-cycle math from §04 is honest. If your corpus refreshes faster than the assumed 5% monthly — news, social, telemetry, anything with a fresh- content imperative — the break-even points shift earlier and cost-led selection becomes correct at smaller corpus sizes. If your corpus is static (regulatory text, technical standards, archived content) the break-evens shift later and quality-led selection holds longer.

One non-obvious wrinkle. The cost gap between vendors widens disproportionately when you include re-embedding for vendor migration. If you anticipate a need to swap embedding vendors inside the planning horizon — which is more common than teams admit — model the re-embedding cost explicitly and lean toward vendors with longer track records of pricing and model stability.

07 — RecommendationsFive archetypes, five picks.

The matrices and the math collapse to five archetypes that cover most production RAG decisions in 2026. Pick the archetype that fits the corpus and the constraints, run a corpus-specific eval before contracting, and commit. Switch costs are real — minimise them by picking well the first time.

First production RAG

OpenAI text-embedding-3-large

Lowest-regret default. Strong quality, stable pricing, matryoshka truncation gives you a dimension-reduction lever later. Pair batch ingest with realtime query and you're inside the most defensible cost envelope at the most defensible quality.

Pick OpenAI

Domain RAG · code, finance, law

Voyage-3 with the matching domain variant

If your corpus is jargon-heavy and answer quality compounds downstream — wrong code retrieval breaks a fix, wrong precedent retrieval breaks a brief — pay the premium per-token rate and recover it many times over in downstream accuracy. Measure the lift on a held-out set first.

Pick Voyage domain

Multilingual RAG

Cohere Embed v4

Coverage across 100+ languages with retrieval quality that holds up materially better than translate-then-embed. The right answer whenever the corpus or the query distribution crosses language boundaries — translate-then-embed is the wrong shape of solution at production scale.

Pick Cohere

Volume RAG · cost-bound

Mistral embed at scale

When the corpus is large, the budget is tight, and the content is generalist English, Mistral's per-token rate is the structural advantage. Quality holds; the cost gap to the leaderboard top is modest on most benchmarks. Run your own eval, then commit.

Pick Mistral

Sovereignty-bound · regulated sector

Jina v4 self-hosted (with hosted fallback)

Open weights plus a credible hosted API gives you the structural option to retire the API dependency over time as the corpus and your hardware investment grow. The right answer for healthcare, financial services, public sector workloads where corpus content cannot leave the trust boundary.

Self-host Jina

Two cross-cutting recommendations. First, measure recall before you spend: build a 50–100-query held-out eval set drawn from your actual product queries, score each candidate vendor against it, and only then move to the cost math. Cost-led selection on a vendor whose quality drops 8 points on your corpus is an expensive mistake. Second, plan the re-embedding budget into the year-2 line: vendor migrations happen, model upgrades happen, and dimension changes happen — re-embedding 500M tokens is a real bill that lives outside the steady-state refresh math.

If your team is shipping its first production RAG pipeline and the corpus is under 1M chunks, the OpenAI default plus batch ingest is the right starting point — almost every time. Our AI digital transformation engagements often start with exactly this kind of comparative eval, and our self-hosted RAG tutorial walks the production pipeline end-to-end with these vendor choices in mind.

For teams that have already shipped a RAG pipeline and suspect retrieval quality is the bottleneck rather than generation quality, the eight-point quality scorecard in our RAG system audit guide is the right next read — it covers the diagnostic side of the same decision.

The shape of embedding decisions, May 2026

Embedding choice compounds — pick by the refresh-cycle math, not the headline price.

Five frontier vendors ship general-purpose embedding APIs today, and the per-1M-token list prices spread across a 3–5× range. That spread is real money for any corpus past 100K chunks — but the headline price is the wrong axis to decide on. Refresh cadence, embedding dimensionality, batch versus realtime split, and the cost of an eventual vendor migration all sit upstream of the per-token rate in a year-3 TCO calculation.

The right discipline is to compare on the full horizon. Build the refresh-cycle formula from §04 against your actual corpus size, refresh rate, and token count. Run a small held-out eval against each candidate vendor on your own content. Pick the cheapest acceptable quality, plan for one re-embedding event inside the horizon, and commit. The decision compounds either way — pick well and refresh runs cheap for years; pick poorly and you spend the difference back every month indefinitely.

The broader signal is that embedding is commoditising on price faster than it is differentiating on quality. The top-three leaderboard gap is narrow, the price gap is wide, and matryoshka truncation gives every team a structural dimension-cost lever to pull as the corpus grows. The decision is no longer about which vendor is the smartest — it is about which vendor's pricing curve, model stability, and migration story best fit the next three years of the RAG pipeline you actually plan to run.

Embedding Model Cost Calculator: Vendor Comparison 2026

01 — Five VendorsOpenAI, Cohere, Voyage, Mistral, Jina — same task, different curves.

text-embedding-3-large

Embed v4

voyage-3 + domain variants

mistral-embed

Jina Embeddings v4

02 — Per-Token PricingRealtime vs batch pricing across providers.

text-embedding-3-large at list rate

Async ingest tier — 50% off list

Realtime with compressed variants

Premium realtime · no public batch tier

Volume tier — lowest per-token list price

Hosted API + self-host fallback

03 — DimensionalityQuality vs storage cost — the tradeoff.

3072 dimensions

1024–1536 dimensions

256–512 dimensions

04 — Refresh MathInitial ingest + (refresh × delta × cost) over 12 months.

Token volume by year · ingest vs refresh on a typical RAG corpus

Tokens · 1M-chunk corpus

Tokens · 5% refresh rate

Variable per traffic

05 — Quality ParetoMTEB and BEIR — where each vendor wins.

Quality positioning · MTEB English retrieval × BEIR average

06 — Break-EvenCorpus size where each vendor wins on cost.

Pricing is a rounding error

Mid-band — vendor choice matters but is reversible

Cost dominates — pick the cheapest acceptable quality

Self-host or hybrid

Quality compounds downstream — pay the premium

07 — RecommendationsFive archetypes, five picks.

OpenAI text-embedding-3-large

Voyage-3 with the matching domain variant

Cohere Embed v4

Mistral embed at scale

Jina v4 self-hosted (with hosted fallback)

Embedding choice compounds — pick by the refresh-cycle math, not the headline price.

Embedding choice compounds — refresh-cycle math matters more than the per-token price.

Production embedding engagements

The questions teams ask before locking in an embedding vendor.

Continue exploring RAG and AI engineering.

Observability Stack TCO: LangSmith vs LangFuse vs Helicone

RAG vs Fine-Tuning TCO Calculator: Comparison 2026

Agent vs Zapier TCO: Workflow Automation Calculator 2026