Embedding model cost is the most miscounted line item in a production RAG budget. The per-1M-token headline price hides the two numbers that actually decide the bill — refresh cadence and embedding dimensionality — and the wrong vendor choice compounds month after month for as long as the corpus keeps changing. This calculator reframes the decision around the math that matters.
What's at stake: a 1M-chunk corpus at 500 tokens per chunk is 500M tokens of initial ingest, and a 5% monthly refresh rate adds another 25M tokens every month indefinitely. Pick the vendor that looked $20 cheaper on a one-off ingest and you can spend the difference back inside a quarter once the refresh cycle starts running. The decision is structural, not price-list.
This guide compares five frontier embedding providers — OpenAI, Cohere, Voyage, Mistral, and Jina — across per-token pricing, dimensionality tradeoffs, MTEB and BEIR quality scores, refresh-cycle math over a 12-month horizon, and break-even corpus sizes where each vendor wins. Five archetypes at the end. Numbers sourced from each vendor's public pricing pages and benchmark reports as of May 4, 2026.
- 01OpenAI text-embedding-3-large is the strong default.Quality plus price stability plus matryoshka truncation make it the lowest-regret choice for most teams shipping their first or second production RAG pipeline. Start here unless you have a measured reason to leave.
- 02Voyage-3 is the quality leader on technical corpora.Domain-specific variants (code, finance, law) reportedly outperform generalist competitors on retrieval recall by meaningful margins. Worth the premium when your corpus is jargon-heavy and accuracy compounds downstream.
- 03Cohere Embed v4 is the multilingual winner.Coverage across 100+ languages with retrieval quality holding up far better than translate-then-embed alternatives. Use when the corpus or the query distribution crosses language boundaries.
- 04Mistral Embed is the cost leader at high volume.Lowest per-1M-token list price among the frontier providers, with quality that holds up on generalist English-language corpora. Use when token budget rather than headline accuracy caps the design.
- 05Refresh-cycle math reframes the decision.Year-1 cost is dominated by initial ingest; year-3 cost is dominated by refresh. A vendor that looks expensive at ingest can win on three-year TCO once delta-token volume compounds. Compare on the full horizon.
01 — Five VendorsOpenAI, Cohere, Voyage, Mistral, Jina — same task, different curves.
Five providers ship frontier-tier general-purpose embedding models into the market today. They share the same job to be done — turn a chunk of text into a dense vector that supports semantic similarity search — but they price, dimension, and benchmark differently enough that the right pick depends on the corpus, not the headline.
Below is the snapshot. Each card lists the flagship model, its native dimension, its standout positioning, and the workload class where the vendor most often earns a place on a shortlist. Read this as the map before reading the math.
text-embedding-3-large
Matryoshka-trained — truncate to 256, 512, 1024, or 1536 dimensions with graceful quality decay. Strong on MTEB English. The lowest-regret default for most production RAG pipelines today.
Default · MatryoshkaEmbed v4
100+ language coverage with retrieval quality that holds up materially better than translate-then-embed approaches. Compressed embedding variants for cost-sensitive deployments. The multilingual specialist.
Multilingualvoyage-3 + domain variants
Domain-specific models for code, finance, and law that reportedly outperform generalist competitors on retrieval recall in their respective sectors. Premium pricing, premium accuracy on jargon-heavy corpora.
Quality leadermistral-embed
Lowest per-1M-token list price among the frontier providers as of May 2026, with quality that holds up on generalist English-language corpora. The volume play when the corpus is large and the budget is tight.
Cost leaderJina Embeddings v4
Open-source weights plus a hosted API. The self-host story is the strongest of the five — useful for sovereignty-bound deployments and teams that want to retire the API dependency entirely over time.
Open weightsThe five-vendor map is stable enough to plan against for the rest of 2026, but pricing pages change quarterly and benchmark leaderboards shift as new models ship. Treat the cards above as the directional positioning; treat the numbers in §02–§04 as valid-as-of-May-2026 and re-verify before committing to a multi-year roadmap.
02 — Per-Token PricingRealtime vs batch pricing across providers.
Per-1M-token list prices are the easy number to grab off a pricing page; what they hide is the realtime / batch split. Most embedding API providers now offer a discounted batch tier for asynchronous ingest workloads — useful precisely because initial corpus ingest and large monthly refreshes can run as batch jobs. Live query embedding still runs at realtime rates.
The matrix below is the operating choice for each vendor — when to pick realtime, when to pick batch, what the practical pricing tier delivers, and the workload class it fits. Numbers are illustrative-tier comparisons; verify exact rates on each provider's current pricing page before contracting.
text-embedding-3-large at list rate
The right tier for query-side embedding, where TTFB matters more than per-token savings. Pair with batch ingest below to split the workload sensibly. Reliable, stable pricing, the lowest-regret default for live retrieval.
Pick for query-sideAsync ingest tier — 50% off list
Use for the initial corpus ingest and any monthly refresh runs that don't need sub-second turnaround. Submit as a batch file, retrieve results within the published SLA window. Effectively halves the dominant cost line for most production RAG budgets.
Pick for ingestRealtime with compressed variants
Compressed embedding variants drop the per-vector storage and bandwidth cost while keeping retrieval quality close to the full-dimension model. Useful when storage is a more binding constraint than per-call latency. Multilingual workloads win here.
Pick for multilingualPremium realtime · no public batch tier
Higher per-token price than the OpenAI / Mistral pair, justified by domain-specific quality on technical corpora. No published batch discount as of May 2026 — model the full ingest cost at list rate when comparing TCO. Premium positioning, premium math.
Pick for domain RAGVolume tier — lowest per-token list price
The cost leader at the top of the table when the corpus is large and the budget is tight. Generalist English-language quality holds up; coverage outside generalist English is narrower than Cohere or OpenAI. Verify on your corpus before committing.
Pick for volumeHosted API + self-host fallback
The hosted API competes on price with Cohere and OpenAI; the open-weight self-host option is the structural advantage. Useful when sovereignty or long-horizon-cost reasons argue for the option to retire the API dependency entirely over time.
Pick for sovereigntyThe most common deployment we see in 2026 splits the workload: a batch tier for initial ingest and monthly refresh, a realtime tier from the same vendor for query embedding. Splitting across vendors is technically possible but operationally painful — you'd need to re-embed the entire query stream every time the ingest vendor changes, since cross-vendor vectors are not comparable.
One quiet gotcha worth flagging: free-tier or low-volume promotional pricing rarely survives the move to production traffic. Always model the cost calculation against the volume-tier rate you'll actually pay, not the introductory rate the pricing page leads with. The numbers in §04 use production-tier rates throughout.
"Per-token list price is the easy number. Refresh-cycle math is the real bill."— Digital Applied engineering team, post-mortem on a 14M-chunk migration
03 — DimensionalityQuality vs storage cost — the tradeoff.
Embedding dimension is the second-most-miscounted variable in a production RAG budget. The default for many providers ships at 1024 or 1536 dimensions; OpenAI's flagship goes to 3072. The cost of storing and searching those vectors scales linearly with dimension count, and the cost of returning a vector to a client (network bandwidth, JSON payload size) scales linearly too.
What changed in 2024–2025 is the broad adoption of matryoshka representation learning — models trained so that the first kdimensions of the embedding remain useful on their own. A matryoshka-trained 3072-dim embedding can be truncated to 1536, 1024, 512, or 256 dimensions with quality decay that is much gentler than naive truncation of a non-matryoshka model. OpenAI's text-embedding-3 family is the canonical example.
3072 dimensions
OpenAI text-embedding-3-large defaultMaximum retrieval quality the model can deliver. Storage cost in pgvector or any dense store is roughly 12 KB per vector at fp32, 6 KB at fp16. Use when accuracy on hard retrieval queries is the binding constraint.
Quality ceiling1024–1536 dimensions
Matryoshka-truncated default for most production deploymentsQuality decay reportedly under 1% on MTEB English versus full 3072. Storage roughly halves. Index build time and query latency improve in line with the dimension drop. The sweet spot for ~80% of production RAG today.
Production default256–512 dimensions
Aggressive truncation for cost-bound or edge deploymentsQuality decay becomes visible on hard retrieval queries — name disambiguation, jargon, near-duplicate detection. Storage drops to roughly 1–2 KB per vector. Worth measuring on your own queries before deploying; not a universal win.
Edge / cost-boundThe decision framework is straightforward in concept and harder in practice: measure recall on a held-out retrieval set at full dimension, then re-measure at 1536, 1024, 512, and 256. Pick the smallest dimension where recall@10 drops by less than a threshold you set — most teams settle on 1–2 percentage points of MTEB-equivalent loss as the acceptable margin.
For non-matryoshka models (some Cohere and Voyage variants), the tradeoff is sharper — truncating a non-matryoshka vector tends to degrade quality much faster, so the practical move is to pick the dimension at training time and accept that as the operating point. Check each vendor's docs for whether matryoshka-style truncation is supported before betting the architecture on it.
04 — Refresh MathInitial ingest + (refresh × delta × cost) over 12 months.
The refresh-cycle formula is the calculator that this whole article exists to put in front of the reader:
12_month_cost
= initial_ingest_cost
+ (refresh_rate_per_month × delta_chunks × tokens_per_chunk × cost_per_million_tokens × 12)
# where:
# initial_ingest_cost = total_chunks × tokens_per_chunk × cost_per_million_tokens
# refresh_rate_per_month = fraction of corpus that changes monthly (typically 2–10%)
# delta_chunks = total_chunks × refresh_rate_per_month
# tokens_per_chunk = average chunk length (typically 300–800)Worked example. A 1M-chunk corpus at 500 tokens per chunk is 500M tokens of initial ingest. At a 5% monthly refresh rate, the delta is 50K chunks per month, or 25M tokens of recurring ingest. Over 12 months that is 300M tokens of refresh on top of the 500M-token initial ingest — refresh becomes roughly 38% of year-1 token volume, and crosses 50% somewhere in year 2 for most realistic corpora.
Token volume by year · ingest vs refresh on a typical RAG corpus
Source: Digital Applied modelling · 1M-chunk corpus · 500 tokens / chunk · 5% monthly refreshThe lesson is simple and chronically underweighted: refresh dominates total cost on any corpus that lives longer than a year. A vendor that looked $200 cheaper on initial ingest can cost $600 more over three years if the per-token rate for steady-state refresh is structurally higher. Compare on the full horizon, not the launch month.
Two operational levers cut refresh volume materially. First, checksum-gated re-embedding: hash each chunk on ingest, compare against the stored hash on refresh, and only re-embed when the content actually changed. On a typical mixed content corpus (some documents drift, most don't), this often cuts refresh volume by 60–80% versus naive re-embed-all. Second, scheduled batching: queue refresh deltas across a week and submit them as a single batch job at the discounted rate, rather than firing realtime embedding calls on every content change.
Tokens · 1M-chunk corpus
One-time cost the first time the corpus is embedded. Almost always run as a batch job at the discounted rate. Dominates year-1 cost on a freshly launched RAG pipeline.
One-timeTokens · 5% refresh rate
Recurring cost as documents update, new pages publish, and source content drifts. Refresh rate varies materially by corpus type — product docs drift slowly, news corpora drift fast.
RecurringVariable per traffic
Live query token volume per month scales with user traffic, not corpus size. Modest in absolute terms for most B2B RAG (~1–10M tokens/month), but always realtime-priced so the per-token rate matters.
Realtime05 — Quality ParetoMTEB and BEIR — where each vendor wins.
Headline benchmark scores tell only the broadest version of the quality story. The two public leaderboards worth tracking are MTEB (Massive Text Embedding Benchmark) for breadth across 50+ retrieval, clustering, and classification tasks, and BEIR for depth on retrieval specifically. Both shift as new models ship; both still tell you which vendor leads at a given moment in time.
The chart below is an illustrative reading of where each frontier vendor sits on the MTEB English retrieval subset and the BEIR average as of May 2026. Numbers shift; the cluster-by-cluster shape is what matters for shortlist construction.
Quality positioning · MTEB English retrieval × BEIR average
Illustrative as of May 2026 · re-verify on the MTEB and BEIR leaderboards before committingThe cluster shape worth seeing: three vendors sit inside a narrow band at the top of the English-generalist benchmark (OpenAI, Voyage, Cohere), with Jina and Mistral a few points behind. Domain-specific variants from Voyage close or invert the gap on technical corpora — code, finance, law — where the domain model was actually trained to win.
Two practical takeaways. First, the top-of-leaderboard gap is narrowin 2026 — typically inside 5 points on MTEB English retrieval. If the per-token cost difference between vendors is 3–5× and the quality difference is <5 points, cost-led selection is defensible for most corpora. Second, the benchmark gap on your corpus may differ materially from the public leaderboard. Always run a small held-out retrieval eval on your own data before locking in a vendor for a multi-year roadmap.
06 — Break-EvenCorpus size where each vendor wins on cost.
With the cost formula in §04 and the quality cluster in §05, the break-even analysis falls out naturally. Each vendor wins on TCO somewhere on the corpus-size × refresh-rate plane; the structural break-evens are stable enough to plan against even as exact rates shift.
The matrix below maps corpus-size bands to the vendor that most often wins year-3 TCO at typical refresh rates, assuming production-tier pricing and ~5% monthly refresh on a mixed-content English-language corpus.
Pricing is a rounding error
Initial ingest is under $20 at any vendor; year-3 TCO is dominated by query embedding which is modest at this scale. Pick on quality, integration, and the rest of the stack. OpenAI's the lowest-regret default.
Pick OpenAIMid-band — vendor choice matters but is reversible
Year-3 TCO spreads start to bite — typically a 2–3× cost gap between cheapest and most expensive vendor on a typical refresh profile. Cost-led selection becomes defensible if quality on your eval set is comparable. Mistral or Cohere often win.
Mistral / CohereCost dominates — pick the cheapest acceptable quality
Initial ingest crosses $1K, year-3 TCO crosses $5K. Cost-led selection is the right call unless a measured quality reason rules out the cheapest vendor. Mistral leads on list price; Cohere wins if multilingual quality matters.
Mistral or CohereSelf-host or hybrid
At this scale the hosted-API token bill compounds materially. Open-weight Jina v4 self-hosted on your own hardware becomes structurally attractive, particularly when sovereignty or compliance argues for retiring the API dependency anyway.
Self-host JinaQuality compounds downstream — pay the premium
Code, finance, law, biomedical — domains where a 2-point retrieval recall lift drives a meaningful answer-quality lift downstream. Voyage's domain models are usually worth the premium per-token price; measure the recall lift on your corpus first.
Voyage domain variantThe break-evens above assume the refresh-cycle math from §04 is honest. If your corpus refreshes faster than the assumed 5% monthly — news, social, telemetry, anything with a fresh- content imperative — the break-even points shift earlier and cost-led selection becomes correct at smaller corpus sizes. If your corpus is static (regulatory text, technical standards, archived content) the break-evens shift later and quality-led selection holds longer.
One non-obvious wrinkle. The cost gap between vendors widens disproportionately when you include re-embedding for vendor migration. If you anticipate a need to swap embedding vendors inside the planning horizon — which is more common than teams admit — model the re-embedding cost explicitly and lean toward vendors with longer track records of pricing and model stability.
07 — RecommendationsFive archetypes, five picks.
The matrices and the math collapse to five archetypes that cover most production RAG decisions in 2026. Pick the archetype that fits the corpus and the constraints, run a corpus-specific eval before contracting, and commit. Switch costs are real — minimise them by picking well the first time.
OpenAI text-embedding-3-large
Lowest-regret default. Strong quality, stable pricing, matryoshka truncation gives you a dimension-reduction lever later. Pair batch ingest with realtime query and you're inside the most defensible cost envelope at the most defensible quality.
Pick OpenAIVoyage-3 with the matching domain variant
If your corpus is jargon-heavy and answer quality compounds downstream — wrong code retrieval breaks a fix, wrong precedent retrieval breaks a brief — pay the premium per-token rate and recover it many times over in downstream accuracy. Measure the lift on a held-out set first.
Pick Voyage domainCohere Embed v4
Coverage across 100+ languages with retrieval quality that holds up materially better than translate-then-embed. The right answer whenever the corpus or the query distribution crosses language boundaries — translate-then-embed is the wrong shape of solution at production scale.
Pick CohereMistral embed at scale
When the corpus is large, the budget is tight, and the content is generalist English, Mistral's per-token rate is the structural advantage. Quality holds; the cost gap to the leaderboard top is modest on most benchmarks. Run your own eval, then commit.
Pick MistralJina v4 self-hosted (with hosted fallback)
Open weights plus a credible hosted API gives you the structural option to retire the API dependency over time as the corpus and your hardware investment grow. The right answer for healthcare, financial services, public sector workloads where corpus content cannot leave the trust boundary.
Self-host JinaTwo cross-cutting recommendations. First, measure recall before you spend: build a 50–100-query held-out eval set drawn from your actual product queries, score each candidate vendor against it, and only then move to the cost math. Cost-led selection on a vendor whose quality drops 8 points on your corpus is an expensive mistake. Second, plan the re-embedding budget into the year-2 line: vendor migrations happen, model upgrades happen, and dimension changes happen — re-embedding 500M tokens is a real bill that lives outside the steady-state refresh math.
If your team is shipping its first production RAG pipeline and the corpus is under 1M chunks, the OpenAI default plus batch ingest is the right starting point — almost every time. Our AI digital transformation engagements often start with exactly this kind of comparative eval, and our self-hosted RAG tutorial walks the production pipeline end-to-end with these vendor choices in mind.
For teams that have already shipped a RAG pipeline and suspect retrieval quality is the bottleneck rather than generation quality, the eight-point quality scorecard in our RAG system audit guide is the right next read — it covers the diagnostic side of the same decision.
Embedding choice compounds — pick by the refresh-cycle math, not the headline price.
Five frontier vendors ship general-purpose embedding APIs today, and the per-1M-token list prices spread across a 3–5× range. That spread is real money for any corpus past 100K chunks — but the headline price is the wrong axis to decide on. Refresh cadence, embedding dimensionality, batch versus realtime split, and the cost of an eventual vendor migration all sit upstream of the per-token rate in a year-3 TCO calculation.
The right discipline is to compare on the full horizon. Build the refresh-cycle formula from §04 against your actual corpus size, refresh rate, and token count. Run a small held-out eval against each candidate vendor on your own content. Pick the cheapest acceptable quality, plan for one re-embedding event inside the horizon, and commit. The decision compounds either way — pick well and refresh runs cheap for years; pick poorly and you spend the difference back every month indefinitely.
The broader signal is that embedding is commoditising on price faster than it is differentiating on quality. The top-three leaderboard gap is narrow, the price gap is wide, and matryoshka truncation gives every team a structural dimension-cost lever to pull as the corpus grows. The decision is no longer about which vendor is the smartest — it is about which vendor's pricing curve, model stability, and migration story best fit the next three years of the RAG pipeline you actually plan to run.