AI DevelopmentNew Release10 min readPublished June 25, 2026

Structured document extraction · 170 languages · $2 per 1,000 pages in batch

Mistral OCR 4: Document AI for Business Automation

Mistral OCR 4 launched June 23, 2026 and returns a structured representation of a document — bounding boxes, block types, and confidence scores — not just clean text. At $4 per 1,000 pages ($2 in batch), it reframes document automation as a cost question. The benchmark claims are vendor-stated and worth reading with care; the pricing and the self-hosting option are the durable story.

DA
Digital Applied Team
Senior AI engineers · Published Jun 25, 2026
PublishedJun 25, 2026
Read time10 min
SourcesMistral + VentureBeat + 6 more
Batch API price
$2/1K
per 1,000 pages · API $4 · Doc AI $5
Languages
170
across 10 language groups
OlmOCRBench
85.20
vendor-stated · ~3rd on public board
Vs Azure Custom
15×
lower per page at batch pricing

Mistral OCR 4 is a document-AI model, released June 23, 2026, that returns a structured representation of any enterprise document — bounding boxes, block types, and confidence scores — rather than the flat wall of text earlier OCR generations produced. For teams building document-automation pipelines, that shift is the headline: extraction stops being a parsing problem and becomes a structured-data problem.

It is Mistral’s fourth OCR generation in roughly 15 months, and it lands in a crowded field — Google Document AI, Amazon Textract, Azure Document Intelligence, ABBYY, and a wave of open-weight models. What sets OCR 4 apart is less any single benchmark number and more the combination of aggressive pricing ($4 per 1,000 pages, $2 in batch), a single-container self-hosting option, and a structured output that removes integration layers teams used to build by hand.

This guide covers what actually shipped, why the structured output changes automation economics, an honest read of the benchmark claims (several of which are vendor-stated), a recomputed cost comparison against the hyperscalers, and how to put OCR 4 to work without overpromising on the numbers.

Key takeaways
  1. 01
    Structured representation, not just text.OCR 4 returns bounding boxes, block types (title, table, equation, signature), and page- and word-level confidence scores as first-class model outputs — the raw material for traceable, auto-approving pipelines.
  2. 02
    Pricing is the durable advantage.$4 per 1,000 pages standard, $2 in batch, $5 for schema-driven Document AI. At batch pricing it undercuts Azure Document Intelligence's custom tier by up to 15x (about 7.5x at the standard rate).
  3. 03
    Read the benchmarks with care.The 85.20 OlmOCRBench, 93.07 OmniDocBench, and ~72% win-rate figures are vendor-stated. On the public OlmOCRBench leaderboard (last updated May 21, 2026) OCR 4 would rank roughly third — not first.
  4. 04
    Self-hosted, but not open-source.A single-container deployment keeps sensitive documents inside your own jurisdiction — useful as the EU AI Act's high-risk obligations approach on August 2, 2026. Self-hosting a commercial model is not the same as open weights.
  5. 05
    Mistral is buying the document layer.Targeting €1B revenue in 2026 (up from ~€200M) and reportedly in early talks to raise ~€3B at roughly €20B, Mistral is pricing OCR 4 to win the enterprise ingestion layer for RAG and search.

01What ShippedA fourth OCR generation that returns structure, not text.

Mistral describes the leap plainly. Where earlier generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document. In practice that means every extracted element arrives with its location on the page, a type label, and a confidence score — three things downstream automation historically had to reconstruct or guess at.

The model accepts PDF, DOC, PPT, and OpenDocument files directly, so the full spread of back-office documents flows in without a pre-conversion step. Mistral reports coverage across 170 languages in 10 language groups, with gains on rare and low-resource languages where competing systems tend to degrade — a meaningful detail for multinational document pipelines.

Release
OCR generation
4th

Mistral's fourth OCR model in roughly 15 months, released June 23, 2026. Available through the Mistral API and Studio, Amazon SageMaker, and Microsoft Foundry at launch, with Snowflake Parse Document integration announced as coming soon.

June 23, 2026
Inputs
PDF · DOC · PPT · ODF
4fmt

Accepts PDF, DOC, PPT, and OpenDocument files directly — the full spread of back-office documents, with no separate pre-conversion step before extraction.

No pre-conversion
Languages
10 language groups
170

Reported coverage across 170 languages, with gains on rare and low-resource languages where competing OCR systems tend to lose accuracy first.

Low-resource gains
Where to get it
OCR 4 is available at launch via the Mistral API (Mistral Studio), Amazon SageMaker, and Microsoft Foundry, with Snowflake Parse Document integration announced as coming soon. The pricing page lists OCR 4 as a Premier model and markets it as “the world’s best document extraction and understanding model” — a vendor claim, not an independent verdict.

02Structured OutputWhy bounding boxes change the pipeline.

Bounding boxes were OCR 4’s most-requested feature, and the reason is structural. Without location data, a downstream RAG or compliance pipeline cannot trace an extracted fact back to the page it came from — the traceability gap that makes audit-ready extraction genuinely hard. With coordinates attached to every element, an extracted number can point back to the exact cell it was read from.

Block classification does similar work one layer up. OCR 4 assigns every element a type — title, table, equation, signature, and others — as a first-class model output rather than a separate post-processing stage. That removes an integration layer enterprise teams used to build in-house just to tell a heading from a table. Confidence scores complete the set: they operate at both page and word level, which is what makes confidence-gated automation possible.

Location
Bounding boxes
coordinates per element

The most-requested OCR 4 feature. Every extracted element carries its position on the page, so a fact can be traced back to its source region — the prerequisite for audit-ready RAG and compliance workflows.

Source traceability
Type
Block classification
title · table · equation · signature

Each element is typed as a first-class model output, not a bolted-on post-processing step. That removes an integration layer teams previously built by hand to distinguish headings, tables, and signatures.

Removes a layer
Certainty
Confidence scores
page level + word level

Inline confidence at both page and word granularity. Auto-approve high-confidence regions and route only low-confidence ones to a human reviewer — no need to read every page.

Confidence-gated
Mistral, in its own words
“Mistral OCR 4 extracts and structures content from a wide range of documents. Where previous generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document.” That single shift — from a text blob to a typed, located, scored object — is what lets a team build a pipeline that approves itself most of the time.

There is a second, related product worth separating clearly. OCR 4 is the extraction model. Document AI is a Studio product, priced at $5 per 1,000 pages, that wraps OCR 4 with a second-pass mistral-small-2603 call to reshape the output into custom JSON schemas. If your automation needs fields in a fixed shape rather than a generic structured document, Document AI is the mode to evaluate — and its dependence on the smaller model is one reason the Mistral Small model family matters to the document stack.

03The Cost CaseThe number that actually moves a decision.

Strip away the benchmarks and the durable advantage is price. OCR 4 is $4 per 1,000 pages on the standard API and $2 per 1,000 in batch mode — a 50% discount for non-interactive workloads. Schema-driven Document AI is $5 per 1,000. The table below annualizes published list pricing across three volumes; every Mistral, Azure, and Google cell is the per-1,000-page rate multiplied out by hand.

Annualized document-AI cost by provider and tier at three yearly page volumes, derived from published per-1,000-page list pricing as of June 25, 2026. Mistral, Azure, and Google figures are list prices multiplied by volume. The Baidu Unlimited-OCR row carries no per-page fee; its self-hosted cost is an infrastructure estimate (GPU compute plus operations), not a quoted rate. Pricing changes frequently — verify on each vendor’s pricing page before budgeting.
OfferingPer 1,000 pages10K pages / yr100K pages / yr1M pages / yr
Mistral OCR 4
OCR 4 — Batch API$2.00$20$200$2,000
OCR 4 — Standard API$4.00$40$400$4,000
Document AI (schema JSON)$5.00$50$500$5,000
Hyperscaler document AI
Azure Doc Intelligence — Read$1.50$15$150$1,500
Azure Doc Intelligence — Custom$30.00$300$3,000$30,000
Google Form Parser~$30.00~$300~$3,000~$30,000
Open-weight, self-hosted
Baidu Unlimited-OCRNo per-page fee*Infra only*Infra only*Infra only*

The 100K-page row is the one to sit with. A firm processing 100,000 pages a year pays $200 in OCR 4 batch mode versus $3,000 on Azure Document Intelligence’s custom extraction tier — a 15x gap that holds at every volume. At the standard $4 API rate the gap is about 7.5x. Azure’s Read tier matches OCR 4’s entry economics at $1.50 per 1,000, but it returns text without the bounding boxes and block types that make OCR 4’s output automation-ready.

A caveat on the bottom row. Baidu’s open-weight Unlimited-OCR has no per-page license fee, but “free” is not zero: you pay for GPU compute, deployment, and operations. A precise per-page figure depends on your hardware, utilization, and throughput, so the table marks those cells as an infrastructure estimate rather than a quoted rate. The honest comparison is a managed per-page price against an amortized infrastructure cost you have to model for your own load.

04BenchmarksThe scores, read honestly.

Mistral reports a top-line OlmOCRBench score of 85.20 and calls it the “top overall score.” That claim deserves a caveat. The public OlmOCRBench leaderboard — last updated May 21, 2026, before OCR 4’s release — places Infinity-Parser2-Pro at 87.6 and Chandra-2 at 85.9 above it, and VentureBeat independently notes OCR 4 would rank roughly third on the current public board. OCR 4’s 85.20 is a vendor-submitted figure that does not yet appear on the independently reproduced leaderboard.

OlmOCRBench · vendor-stated OCR 4 vs the independent leaderboard

Source: Mistral (OCR 4, vendor-stated); OlmOCRBench public leaderboard via CodeSOTA, last updated May 21, 2026
Infinity-Parser2-ProIndependently reproduced · public leaderboard #1
87.6
Public board
Chandra-2Independently reproduced · public leaderboard
85.9
Public board
Mistral OCR 4Vendor-submitted · not yet on the public board
85.20
Vendor-stated
Dots.mocrIndependently reproduced · public leaderboard
83.9
Public board
Mistral OCR 4 (vendor-stated)OlmOCRBench public board · May 21, 2026

The rest of the benchmark story is similarly vendor-framed, and worth keeping in that frame. OlmOCRBench itself, built by the Allen Institute for AI, runs 7,010 unit tests across 1,403 PDFs in seven categories, with per-score uncertainty of roughly a point either way — so small gaps between models are inside the noise. The figures below are Mistral’s own; treat them as directional.

OmniDocBench
Vendor-stated
93.07

Mistral's reported OmniDocBench score. For context, PaddleOCR-VL-1.6 self-reports 96.33, though that result has not been independently reproduced on the public leaderboard either.

Not third-party verified
Human eval
Average win rate
~72%

Average head-to-head win rate against leading competitors across 600+ real-world documents in 12+ languages, judged by independent annotators Mistral commissioned. The annotators were independent; the study was vendor-run.

Vendor-commissioned
Internal eval
Crawl Multilingual
.98

Mistral's internal multilingual evaluation, reported as leading across all eight language groups. This is an internal benchmark and cannot be independently verified.

Internal · unverifiable
The transparency worth noting
Mistral did something unusual: it published the scoring artifacts it found in OlmOCRBench — ground-truth errors, equivalent LaTeX notation scored as mismatches, column-reading assumptions, header/footer attribution issues — and wrote that it therefore treats the aggregate score as “directional rather than definitive.” Read that as a signal of engineering credibility, and as a template for how to weigh any vendor’s OCR benchmark, not just this one.

What does the 85.20 actually measure? The table below maps OlmOCRBench’s seven categories to the back-office failure each one predicts — the difference between an abstract leaderboard and a decision about whether to trust extraction on your own documents.

The seven OlmOCRBench test categories (from the Allen Institute olmOCR-bench dataset, 7,010 tests across 1,403 PDFs) mapped to what each measures and the back-office failure mode it predicts. The failure-mode column is Digital Applied editorial interpretation, not part of the benchmark.
CategoryWhat it checksWhat failure looks like in your workflow
arXiv MathEquation fidelity in LaTeXA formula in a research report or actuarial model transcribes wrong, silently changing a result.
TablesRow/column structure recoveryAn invoice or financial statement loses its grid, so totals land in the wrong fields downstream.
Headers / FootersBoilerplate vs body separationPage numbers, disclaimers, or letterhead bleed into the extracted body text of a contract.
Multi-ColumnReading order across columnsA two-column policy or terms document gets interleaved, scrambling clauses out of sequence.
Old ScansDegraded-image legibilityAn archived deed, claim file, or shipping record returns garbled text the pipeline cannot trust.
Old Scans MathFormulas on degraded scansBoth failure modes stack — a faint historical engineering or finance document loses its numbers.
Long / Tiny TextDense or small-font passagesFine-print footnotes and dense appendices — exactly where the binding terms hide — drop out.

05DeploymentAPI, marketplace, or your own jurisdiction.

OCR 4 ships three ways to consume it, and the third is the strategic one. Beyond the managed API and the cloud marketplaces, Mistral supports a single-container self-hosted deployment — letting a regulated enterprise process sensitive documents entirely inside its own infrastructure, with no routing to an external U.S.-jurisdiction cloud API. For organizations weighing the broader tradeoffs, our self-hosted deployment decision guide covers the infrastructure side in depth.

Managed
Mistral API / Studio

Lowest-friction path. Per-page billing, batch discount for non-interactive jobs, and Document AI schema extraction in the same place. Best when data residency is not a hard constraint and you want to move fast.

Fastest to ship
Cloud
SageMaker · Microsoft Foundry

Run OCR 4 inside an existing AWS or Azure footprint, billed through accounts you already govern. Snowflake Parse Document integration is announced as coming soon. Best when you have committed cloud spend and procurement rails.

Inside your cloud
Sovereign
Single-container self-host

Documents never leave your infrastructure — the answer to data-residency and sovereignty requirements as the EU AI Act's high-risk obligations approach on August 2, 2026. Self-hosting a commercial model is not the same as open weights.

Documents stay home
"At some point, you need to be able to turn it off or turn it on, and you don't want to leave it to another country."— Arthur Mensch, CEO, Mistral AI, on AI sovereignty (London Tech Week, June 2025)
One precise distinction
Self-hosted does not mean open-source. OCR 4 is a commercial API product with an enterprise self-hosting option; the weights are not openly licensed the way a true open-weight model’s are. If open weights are a hard requirement, Baidu’s Unlimited-OCR is the model to look at — not OCR 4 in a container.

06The FieldWho OCR 4 is actually competing with.

OCR 4 arrives against established hyperscaler document AI and a fast wave of open-weight models. The most direct counterpoint shipped one day earlier: Baidu’s Unlimited-OCR, a 3-billion-parameter, MIT-licensed model that parses entire PDFs in a single forward pass and gathered roughly 1,800 GitHub stars in its first 24 hours. It is free and self-hosted — and it has no managed API and no enterprise SLA, which is exactly the gap OCR 4’s paid tier fills. Mistral’s own open-weight model lineage is part of why a self-hosting story is even credible from this vendor.

Hyperscaler
Azure Document Intelligence

The incumbent comparison. Read tier at $1.50 per 1,000 pages matches OCR 4's entry price but without bounding boxes; the custom extraction tier runs $30 per 1,000 — the 15x gap at the top of this post.

Incumbent on Azure
Hyperscaler
Google & Amazon

Google's Form Parser runs ~$30 per 1,000 pages; Amazon Textract is the established AWS option. Deep ecosystem integration, but priced well above OCR 4's per-page economics for structured extraction.

Ecosystem default
Open weight
Baidu Unlimited-OCR

Free, MIT-licensed, self-hosted, 3B params, single-pass PDF parsing. No managed API and no enterprise SLA — you own the deployment and the operations. The DIY counterpoint to a paid managed model.

Free, you run it
Established IDP
ABBYY · Textract incumbents

Mature intelligent-document-processing suites with template libraries and human-in-the-loop tooling built in. Strong on entrenched workflows; the question is per-page cost and how much of the new structured output you'd be re-buying.

Entrenched workflows

07Market & MomentumA land-grab for the ingestion layer.

The pricing makes more sense as strategy than as a margin play. The global intelligent document processing market was about $2.30B in 2024 and is projected to reach $12.35B by 2030 at a 33.1% CAGR, with BFSI the largest segment. OCR 4 feeds directly into Mistral’s Search Toolkit as the ingestion layer for RAG and enterprise search — so winning document extraction is really about owning the front door to every downstream AI workflow.

The financial backdrop fits that ambition. Mistral is targeting €1 billion in revenue for 2026, up from roughly €200 million in 2025, and is reportedly in early discussions to raise about €3 billion at a valuation near €20 billion — nearly double its €11.7 billion Series C from September 2025. No round has been announced as of late June. Pricing OCR 4 to undercut the hyperscalers by an order of magnitude is how you buy share in a market growing at 33% a year, and pairs naturally with Mistral’s broader enterprise AI stack.

Market 2030
IDP market forecast
$12.35B

Up from $2.30B in 2024 at a 33.1% CAGR, per Grand View Research. North America holds 32%+ of the 2024 market and BFSI is the largest end-use segment — the buyers OCR 4's sovereignty story targets.

33.1% CAGR
Revenue
2026 revenue target
€1B

Up from roughly €200M in 2025 — a 5x target (Le Monde, via VentureBeat). OCR 4 and its document-AI pipeline are central to that trajectory, which is why the per-page price is set to win share.

5x vs 2025
Valuation
Reported funding talks
~€20B

Mistral is reportedly in early discussions to raise ~€3B at roughly €20B — nearly double its €11.7B Series C (Sep 2025), when ASML took an 11% stake. No deal has been announced as of June 25.

Early discussions

08Putting It to WorkFrom a structured output to a pipeline that approves itself.

The practical payoff of confidence scores is a pipeline that does not ask a human to read every page. Set a threshold; auto-approve regions above it; route the rest to review. Bounding boxes give the reviewer the exact spot to look, and block types let you apply different rules to tables, signatures, and free text. That is the difference between an OCR tool and a document-automation system — and it is where the customer testimonials, hedged appropriately, point.

Two of those testimonials are worth quoting with that hedge in mind. Rogo, a financial-AI firm, reported reaching equivalent accuracy at roughly 8x lower cost and 17x lower latency versus leading agentic document parsers; Anaqua, an IP-management firm, reported OCR 4 is roughly 4x faster per page than its incumbent. Both are customer statements on single, undisclosed datasets — directional evidence, not reproduced benchmarks. The right move is to run OCR 4 against your own documents before you commit a forecast to it.

If you are mapping document automation onto a real back-office process — invoice capture, claims intake, contract review, CRM data entry — the structured output is the input, but the value is in the workflow around it. That scoping work is exactly what our AI digital transformation engagements start with: a confidence-gating design and an honest per-page cost model before any vendor commitment.

High volume, structured
Invoice & form capture

Batch mode at $2 per 1,000 pages plus confidence-gated review is the strongest fit. The 100K-page row in the cost table is this use case — $200 a year in extraction versus thousands on a custom hyperscaler tier.

OCR 4 batch + gating
Fixed-schema output
Structured JSON pipelines

When you need fields in a fixed shape, Document AI at $5 per 1,000 reshapes OCR output into custom schemas via a second-pass model — worth the premium only if the schema step earns it.

Document AI mode
Regulated data
Sovereignty-bound workloads

Single-container self-hosting keeps documents in your jurisdiction as the EU AI Act's high-risk obligations approach. Model the amortized infrastructure cost against the managed per-page price for your real volume.

Self-host, then measure
Open-weight requirement
No commercial dependency

If open weights are non-negotiable, evaluate Baidu Unlimited-OCR instead and budget for the GPU and ops you'll own. OCR 4 in a container is sovereign, but it is still a commercial license.

Open-weight alternative

09ConclusionA pricing move dressed as a model release.

The shape of document AI, June 2026

Document automation just became a cost question, not a capability one.

Mistral OCR 4 is best understood less as a benchmark winner than as a pricing and packaging move. The structured output — bounding boxes, block types, confidence scores — is genuinely useful and removes integration work teams used to do by hand. The per-page price, at $2 in batch, is what makes large-scale digitization economically boring in the best sense: a 100,000-page archive for $200 stops being a budget conversation.

Keep the benchmark claims in their box. The 85.20 OlmOCRBench, 93.07 OmniDocBench, and ~72% win-rate figures are vendor-stated, and on the independent public leaderboard OCR 4 would sit around third, not first. Mistral’s own “directional rather than definitive” framing is the right posture to borrow — and the reason to run the model against your own documents rather than trust the headline.

The forward read is straightforward. With Baidu shipping a free open-weight parser the day before and Mistral pricing a managed model at an order-of-magnitude discount to the hyperscalers, the margin in raw extraction is compressing fast. The value is migrating to the workflow above it — confidence-gating, schema design, and the sovereignty wrapper — and to whoever owns the ingestion layer feeding every downstream RAG and search pipeline. That, not a leaderboard row, is what OCR 4 is really competing for.

Turn documents into automated workflows

Structured extraction at $2 per 1,000 pages makes document automation genuinely affordable.

We help teams design document-automation pipelines on models like Mistral OCR 4 — confidence-gating, schema extraction, and an honest per-page cost model before any vendor commitment, delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Document-automation engagements

  • OCR 4 evaluation against your own document set
  • Confidence-gated pipelines — auto-approve, route the rest
  • Schema design for structured JSON extraction
  • Self-hosted vs managed cost modelling for regulated data
  • Ingestion layer for RAG and enterprise search
FAQ · Mistral OCR 4

The questions we get every week.

Mistral OCR 4 is a document-AI model released on June 23, 2026 — Mistral's fourth OCR generation in roughly 15 months. Rather than converting a page into flat text, it returns a structured representation of the document: bounding boxes for every element, block types (title, table, equation, signature, and others), and page- and word-level confidence scores. It accepts PDF, DOC, PPT, and OpenDocument files directly and reports coverage across 170 languages in 10 language groups. It is available through the Mistral API and Studio, Amazon SageMaker, and Microsoft Foundry at launch, with Snowflake Parse Document integration announced as coming soon.
Related dispatches

Continue exploring frontier releases.